[Swift-devel] Re: coaster error on ranger

Michael Wilde wilde at mcs.anl.gov
Thu Jun 11 10:29:35 CDT 2009


There is some likelihood that ampl itself is exitting with a non-zero 
exit code (12 I suspect) due ot a subscript error at the near-correct 
termination of the model (ie it runs usefully to the end, then dies when 
it runs off the end of an array).  We know the fix for this.

But I wonder, in the case below, Zhao: is this happening when ampl gets 
one of these errors, or is it running one job OK on a coaster, and then 
running into a timeout on the next job?

What was the mapping of the number of jobs in this script (100 I think) 
to the number of coasters started? Did the error occur when it tried to 
start a second long job on a coaster after a prior (long) job had 
already completed?

- Mike


On 6/11/09 10:22 AM, Mihael Hategan wrote:
> On Thu, 2009-06-11 at 09:24 -0500, Zhao Zhang wrote:
>> Hi, Mike and Mihael
>>
>> Here is the error, I think this is related to the job wall time of 
>> coaster settings.
>>
>> Mihael, could you give me some suggestions on how to set the parameters 
>> for coasters on ranger?
> 
> I need to know what the problem is first. And for that I need to take a
> look at the coaster log (and possibly gram logs). So if you could copy
> that to some shared space in the CI, that would be good.
> 
>> For now I am running 100 jobs, each job could take 2~3 hours. Thanks.
>>
>> best
>> zhao
>>
>> Execution failed:
>>         Exception in run_ampl:
>> Arguments: [run70, template, armington.mod, armington_process.cmd, 
>> armington_ou\
>> tput.cmd, subproblems/producer_tree.mod, ces.so]
>> Host: tgtacc
>> Directory: ampl-20090611-0122-hzktisu5/jobs/h/run_ampl-h92ap3cj
>> stderr.txt:
>>
>> stdout.txt:
>> ----
>>
>> Caused by:
>>         Shutting down worker
>> Cleaning up...
>> Shutting down service at https://129.114.50.163:58556
>>
>> And here is my sites.xml
>> bash-3.00$ cat tgranger-sge-gram2.xml
>> <config>
>>   <pool handle="tgtacc" >
>>     <gridftp  url="gsiftp://gridftp.ranger.tacc.teragrid.org" />
>>     <execution  provider="coaster" 
>> url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/>
>>     <!-- <profile namespace="globus" 
>> key="project">TG-DBS080004N</profile> -->
>>     <profile namespace="globus" key="project">TG-CCR080022N</profile>
>>     <workdirectory >/work/00946/zzhang/work</workdirectory>
>>     <profile namespace="env" 
>> key="SWIFT_JOBDIR_PATH">/tmp/zzhang/jobdir</profile>
>>     <profile namespace="globus" key="coastersPerNode">16</profile>
>>     <profile namespace="globus" key="queue">development</profile>
>>     <profile namespace="karajan" key="initialScore">100</profile>
>>     <profile namespace="karajan" key="jobThrottle">10</profile>
>>     <profile namespace="globus" key="slots">20</profile>
>>     <profile namespace="globus" key="lowOverAllocation">5</profile>
>>     <profile namespace="globus" key="highOverAllocation">1</profile>
>>     <profile namespace="globus" key="maxNodes">5</profile>
>>   </pool>
>> </config>
>>
>>
>>
>>
> 



More information about the Swift-devel mailing list