[Swift-devel] Re: coaster error on ranger

Zhao Zhang zhaozhang at uchicago.edu
Thu Jun 11 11:06:48 CDT 2009


Hi, Mike

I am attaching the whole log at the end.
 From the log, we could tell that no job is successful at the point when 
the the work flow exits. And the workflow
has been running for only 13 minutes.

I also copy the swift-work dir back to CI net work, it is at 
/home/zzhang/see/logs/ampl-20090611-0122-hzktisu5.
Although no job in the workflow returned successful, I did find 22 
result files in
/home/zzhang/see/logs/ampl-20090611-0122-hzktisu5/shared/result

You could take a look at run14 as an example. I echo the exit code of 
the ampl script at the end of run_ampl:
 ** EXIT - solution found.

Major Iterations. . . . 4
Minor Iterations. . . . 36
Restarts. . . . . . . . 0
Crash Iterations. . . . 0
Gradient Steps. . . . . 0
Function Evaluations. . 5
Gradient Evaluations. . 5
Basis Time. . . . . . . 25.713607
Total Time. . . . . . . 27.701732
Residual. . . . . . . . 2.998933e-07
Postsolved residual: 2.9989e-07
Path 4.7.01: Solution found.
4 iterations (0 for crash); 36 pivots.
5 function, 5 gradient evaluations.
exitcode 2

See here? the exit code is 2, which mean, the ampl script has error 
itself. I know you said Todd has a fix for this,
but I didn't find it. The code I was running is the latest from svn. Any 
idea about this?

best wishes
zhao


Swift svn swift-r2953 cog-r2406

RunID: 20090611-0122-hzktisu5
Progress:
Progress:  uninitialized:1
Progress:  Selecting site:98  Initializing site shared directory:1  
Stage in:1
Progress:  Stage in:99  Submitting:1
Progress:  Submitting:99  Submitted:1
Progress:  Submitted:100
Progress:  Submitted:100
Progress:  Submitted:100
Progress:  Submitted:100
Progress:  Submitted:99  Active:1
Progress:  Submitted:82  Active:18
Progress:  Submitted:77  Active:23
Progress:  Submitted:77  Active:23
Progress:  Submitted:77  Active:23
Progress:  Submitted:77  Active:23
Progress:  Submitted:77  Active:23
Progress:  Submitted:77  Active:23
Progress:  Submitted:77  Active:23
Progress:  Submitted:77  Active:23
Progress:  Submitted:77  Active:23
Progress:  Submitted:77  Active:23
Progress:  Submitted:77  Active:23
Progress:  Submitted:77  Active:23
Progress:  Submitted:77  Active:23
Progress:  Submitted:77  Active:23
Progress:  Submitted:77  Active:23
Progress:  Submitted:77  Active:23
Progress:  Submitted:77  Active:23
Progress:  Submitted:77  Active:23
Progress:  Submitted:77  Active:23
Progress:  Submitted:77  Active:23
Progress:  Submitted:77  Active:23
Progress:  Submitted:77  Active:22 Failed but can retry:1
Failed to transfer wrapper log from ampl-20090611-0122-hzktisu5/info/h 
on tgtacc
Failed to transfer wrapper log from ampl-20090611-0122-hzktisu5/info/s 
on tgtacc
Progress:  Submitted:75  Active:21  Failed:1 Failed but can retry:3
Execution failed:
        Exception in run_ampl:
Arguments: [run70, template, armington.mod, armington_process.cmd, 
armington_output.cmd, subproblems/producer_tree.mod, ces.so]
Host: tgtacc
Directory: ampl-20090611-0122-hzktisu5/jobs/h/run_ampl-h92ap3cj
stderr.txt:

stdout.txt:

----

Caused by:
        Shutting down worker
Cleaning up...
Shutting down service at https://129.114.50.163:58556
Got channel MetaChannel: 6217586 -> GSSSChannel-null(1)
Failed to transfer wrapper log from ampl-20090611-0122-hzktisu5/info/o 
on tgtacc
- Done

Michael Wilde wrote:
> There is some likelihood that ampl itself is exitting with a non-zero 
> exit code (12 I suspect) due ot a subscript error at the near-correct 
> termination of the model (ie it runs usefully to the end, then dies 
> when it runs off the end of an array).  We know the fix for this.
>
> But I wonder, in the case below, Zhao: is this happening when ampl 
> gets one of these errors, or is it running one job OK on a coaster, 
> and then running into a timeout on the next job?
>
> What was the mapping of the number of jobs in this script (100 I 
> think) to the number of coasters started? Did the error occur when it 
> tried to start a second long job on a coaster after a prior (long) job 
> had already completed?
>
> - Mike
>
>
> On 6/11/09 10:22 AM, Mihael Hategan wrote:
>> On Thu, 2009-06-11 at 09:24 -0500, Zhao Zhang wrote:
>>> Hi, Mike and Mihael
>>>
>>> Here is the error, I think this is related to the job wall time of 
>>> coaster settings.
>>>
>>> Mihael, could you give me some suggestions on how to set the 
>>> parameters for coasters on ranger?
>>
>> I need to know what the problem is first. And for that I need to take a
>> look at the coaster log (and possibly gram logs). So if you could copy
>> that to some shared space in the CI, that would be good.
>>
>>> For now I am running 100 jobs, each job could take 2~3 hours. Thanks.
>>>
>>> best
>>> zhao
>>>
>>> Execution failed:
>>>         Exception in run_ampl:
>>> Arguments: [run70, template, armington.mod, armington_process.cmd, 
>>> armington_ou\
>>> tput.cmd, subproblems/producer_tree.mod, ces.so]
>>> Host: tgtacc
>>> Directory: ampl-20090611-0122-hzktisu5/jobs/h/run_ampl-h92ap3cj
>>> stderr.txt:
>>>
>>> stdout.txt:
>>> ----
>>>
>>> Caused by:
>>>         Shutting down worker
>>> Cleaning up...
>>> Shutting down service at https://129.114.50.163:58556
>>>
>>> And here is my sites.xml
>>> bash-3.00$ cat tgranger-sge-gram2.xml
>>> <config>
>>>   <pool handle="tgtacc" >
>>>     <gridftp  url="gsiftp://gridftp.ranger.tacc.teragrid.org" />
>>>     <execution  provider="coaster" 
>>> url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/>
>>>     <!-- <profile namespace="globus" 
>>> key="project">TG-DBS080004N</profile> -->
>>>     <profile namespace="globus" key="project">TG-CCR080022N</profile>
>>>     <workdirectory >/work/00946/zzhang/work</workdirectory>
>>>     <profile namespace="env" 
>>> key="SWIFT_JOBDIR_PATH">/tmp/zzhang/jobdir</profile>
>>>     <profile namespace="globus" key="coastersPerNode">16</profile>
>>>     <profile namespace="globus" key="queue">development</profile>
>>>     <profile namespace="karajan" key="initialScore">100</profile>
>>>     <profile namespace="karajan" key="jobThrottle">10</profile>
>>>     <profile namespace="globus" key="slots">20</profile>
>>>     <profile namespace="globus" key="lowOverAllocation">5</profile>
>>>     <profile namespace="globus" key="highOverAllocation">1</profile>
>>>     <profile namespace="globus" key="maxNodes">5</profile>
>>>   </pool>
>>> </config>
>>>
>>>
>>>
>>>
>>
>



More information about the Swift-devel mailing list