[Swift-devel] Fwd: raptor-loop model problem

Michael Wilde wilde at mcs.anl.gov
Fri May 28 13:13:45 CDT 2010


Wenjun, in the attached log, I see 5 boost-threader jobs starting but not finishing.

Then *I think* the coasters start timing out with nothing else to do.

Mihael, can you take a look at this log and work with Wenjun to pinpoint the problem?

Thanks,

- Mike

----- Forwarded Message -----
From: "wenjun wu" <wwjag at mcs.anl.gov>
To: "Michael Wilde" <wilde at mcs.anl.gov>
Cc: "Thomas D. Uram" <turam at mcs.anl.gov>
Sent: Friday, May 28, 2010 10:48:40 AM GMT -06:00 US/Canada Central
Subject: Re: raptor-loop model problem

Hi Mike,
     After I fixed the File uploader in the portal, the old problem 
""cannot open seq file PREPROCESSED/SEQ/T0411D1.seq for read!" is gone.
    But the portal still can't get the results from the "BoosterThread" 
result.
    The error in the workflow run log is:
       2010-05-27 13:25:52,842-0500 WARN  RequestHandler
org.globus.cog.karajan.workflow.service.channels.IrrecoverableException: 
Coaster service ended. Reason: null
     stdout:
     stderr:
     at 
org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.statusChanged(ServiceManager.java:230)
     at 
org.globus.cog.abstraction.impl.common.task.TaskImpl.notifyListeners(TaskImpl.java:236)
     at 
org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:224)
     at 
org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:253)
     at 
org.globus.cog.abstraction.impl.ssh.execution.JobSubmissionTaskHandler.SSHTaskStatusChanged(JobSubmissionTaskHandler.java:193)
     at 
org.globus.cog.abstraction.impl.ssh.SSHRunner.notifyListeners(SSHRunner.java:84)
     at org.globus.cog.abstraction.impl.ssh.SSHRunner.run(SSHRunner.java:43)
     at java.lang.Thread.run(Thread.java:595)
2010-05-27 13:25:52,843-0500 INFO  AbstractStreamKarajanChannel 
1427072207: Channel shut down
java.lang.Throwable
     at 
org.globus.cog.karajan.workflow.service.channels.AbstractTCPChannel.close(AbstractTCPChannel.java:97)
     at 
org.globus.cog.karajan.workflow.service.channels.MetaChannel.close(MetaChannel.java:87)
     at 
org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.statusChanged(ServiceManager.java:232)
     at 
org.globus.cog.abstraction.impl.common.task.TaskImpl.notifyListeners(TaskImpl.java:236)
     at 
org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:224)
     at 
org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:253)
     at 
org.globus.cog.abstraction.impl.ssh.execution.JobSubmissionTaskHandler.SSHTaskStatusChanged(JobSubmissionTaskHandler.java:193)
     at 
org.globus.cog.abstraction.impl.ssh.SSHRunner.notifyListeners(SSHRunner.java:84)
     at org.globus.cog.abstraction.impl.ssh.SSHRunner.run(SSHRunner.java:43)
     at java.lang.Thread.run(Thread.java:595)
2010-05-27 13:25:52,843-0500 INFO  ConnectionProtocol Freeing channel 4 
[Unnamed Channel]
2010-05-27 13:25:58,866-0500 INFO  
AbstractStreamKarajanChannel$Multiplexer No streams
2010-05-27 13:26:08,883-0500 INFO  
AbstractStreamKarajanChannel$Multiplexer No streams
2010-05-27 13:26:18,905-0500 INFO  
AbstractStreamKarajanChannel$Multiplexer No streams
2010-05-27 13:26:28,920-0500 INFO  
AbstractStreamKarajanChannel$Multiplexer No streams
2010-05-27 13:26:38,924-0500 INFO  
AbstractStreamKarajanChannel$Multiplexer No streams
2010-05-27 13:26:48,934-0500 INFO  
AbstractStreamKarajanChannel$Multiplexer No streams
2010-05-27 13:26:56,456-0500 INFO  TransportProtocolCommon Sending 
SSH_MSG_DISCONNECT
2010-05-27 13:26:56,456-0500 INFO  Service ssh-connection thread is exiting

It seems something went wrong after the BoostThreader job is finished. 
So the swift engine never gets the result back and run into endless waiting.
And in the coaster.log, there also are some error messages :

2010-05-27 13:25:46,741-0500 WARN  Command Command(38, JOBSTATUS): 
handling reply timeout; sendReqTime=100527-132346.736, 
sendTime=100527-132346.737, now=100527-132546.741
2010-05-27 13:25:46,741-0500 WARN  Command Command(38, JOBSTATUS)fault 
was: Reply timeout
org.globus.cog.karajan.workflow.service.ReplyTimeoutException
         at 
org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269)
         at 
org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274)
         at java.util.TimerThread.mainLoop(Timer.java:512)
         at java.util.TimerThread.run(Timer.java:462)
2010-05-27 13:25:46,851-0500 WARN  Command Command(39, JOBSTATUS): 
handling reply timeout; sendReqTime=100527-132346.847, 
sendTime=100527-132346.848, now=100527-132546.851
2010-05-27 13:25:46,851-0500 WARN  Command Command(39, JOBSTATUS)fault 
was: Reply timeout
org.globus.cog.karajan.workflow.service.ReplyTimeoutException
         at 
org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269)
         at 
org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274)
         at java.util.TimerThread.mainLoop(Timer.java:512)
         at java.util.TimerThread.run(Timer.java:462)
2010-05-27 13:25:47,335-0500 WARN  Command Command(40, JOBSTATUS): 
handling reply timeout; sendReqTime=100527-132347.331, 
sendTime=100527-132347.332, now=100527-132547.335
2010-05-27 13:25:47,335-0500 WARN  Command Command(40, JOBSTATUS)fault 
was: Reply timeout
org.globus.cog.karajan.workflow.service.ReplyTimeoutException
         at 
org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269)
         at 
org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274)
         at java.util.TimerThread.mainLoop(Timer.java:512)
         at java.util.TimerThread.run(Timer.java:462)
2010-05-27 13:25:47,478-0500 INFO  Cpu 0527-011146-000000:3 pull
2010-05-27 13:25:47,804-0500 INFO  BlockQueueProcessor Updated 
allocsize: 6.519897787948322
2010-05-27 13:25:47,805-0500 INFO  BlockQueueProcessor allocsize = 
6.519897787948322, queuedsize = 0.0, qsz = 0
2010-05-27 13:25:47,805-0500 INFO  BlockQueueProcessor Plan time: 1
2010-05-27 13:25:48,084-0500 WARN  Command Command(41, JOBSTATUS): 
handling reply timeout; sendReqTime=100527-132348.080, 
sendTime=100527-132348.081, now=100527-132548.084
2010-05-27 13:25:48,084-0500 WARN  Command Command(41, JOBSTATUS)fault 
was: Reply timeout
org.globus.cog.karajan.workflow.service.ReplyTimeoutException
         at 
org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269)
         at 
org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274)
         at java.util.TimerThread.mainLoop(Timer.java:512)
         at java.util.TimerThread.run(Timer.java:462)
2010-05-27 13:25:48,480-0500 INFO  Cpu 0527-011146-000000:0 pull
2010-05-27 13:25:49,482-0500 INFO  Cpu 0527-011146-000000:5 pull
2010-05-27 13:25:49,734-0500 WARN  Command Command(42, JOBSTATUS): 
handling reply timeout; sendReqTime=100527-132349.730, 
sendTime=100527-132349.731, now=100527-132549.734
2010-05-27 13:25:49,734-0500 WARN  Command Command(42, JOBSTATUS)fault 
was: Reply timeout
org.globus.cog.karajan.workflow.service.ReplyTimeoutException
         at 
org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:269)
         at 
org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:274)
         at java.util.TimerThread.mainLoop(Timer.java:512)
         at java.util.TimerThread.run(Timer.java:462)
2010-05-27 13:25:50,007-0500 INFO  Block Shutting down block Block 
0527-011146-000000 (6x4200.000s)
2010-05-27 13:25:50,009-0500 INFO  BlockQueueProcessor Cleaned 1 done blocks
2010-05-27 13:25:50,009-0500 INFO  BlockQueueProcessor Updated 
allocsize: 6.519849641186395

The complete log is attached.

Wenjun
> Hi Mike,
>   My raptorloop script doesn't work from portal but worked when I run 
> from the attached shell script.
> After digging up swift and raptor log files,
> I found out the error message: "cannot open seq file 
> PREPROCESSED/SEQ/T0411D1.seq for read!"
> BoostThreader is supposed to untar the prepared tar ball to PREPROCESSED.
>   I got a feeling that it was caused by input arguments. So from the 
> portal, the input arguments look like:
>
> -target=T0411D1 -maxLoopModel=1 -minLoopModelScore=1.0 -minLoopSize=3 
> -maxLoopsPerModel=10 -templateList=20
>  -prepTar=/gpfs/pads/oops/scienceportal/apache-tomcat-5.5.27/webapps/SIDGridPortal//temp/AE00A497C18DB8885C24D04862A0909A/t0411d1.prep.tar.gz 
>
> -seqFile=/gpfs/pads/oops/scienceportal/apache-tomcat-5.5.27/webapps/SIDGridPortal//temp/1D8389428752F99E3C0A14789C07F55C/t0411d1.fasta 
> -templatesPerJob=4
> -nModels=10
>
>
>   And the successful raptor run has the following arguments:
>   -target=T0411D1 \
> -seqFile=/home/aashish/testPrep/T0411D1.fasta \
> -prepTar=/gpfs/pads/oops/scienceportal/apache-tomcat-5.5.27/webapps/SIDGridPortal/temp/AE00A497C18DB8885C24D04862A0909A/t0411d1.prep.tar.gz 
> \
> -templatesPerJob=4 -templateList=20 -nModels=10 -nSim=4 \
> -loopRunParamFile=$(pwd)/loopmodels.param \
> -maxLoopModels=1 \
> -minLoopModelScore=1.0 \
> -minLoopSize=3 \
> -maxLoopsPerModel=10
>
>  Could you figure out any reason for this problem?
>
> Wenjun


-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: oops-20100527-1051-78q7905g.log
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20100528/90a1e72d/attachment.ksh>


More information about the Swift-devel mailing list