[Swift-devel] Re: coaster error on ranger
Zhao Zhang
zhaozhang at uchicago.edu
Thu Jun 11 13:13:34 CDT 2009
Hi, Mihael
Actually, I have no idea how long would these jobs run. Some of them
just took ~10 minutes, and some of them went far more than this.
What if I set the wall to 120 minutes, what will happen when the wall
time is up but the job doesn't finish?
<profile namespace="globus" key="maxtime">120</profile>
zhao
Mihael Hategan wrote:
> On Thu, 2009-06-11 at 13:04 -0500, Zhao Zhang wrote:
>
>> No, I don't specify any wall time.
>>
>
> Well, you need to specify one.
>
>
>> The last entry is for the run_ampl script.
>>
>> zhao
>>
>> login3% cat tc.data
>> #This is the transformation catalog.
>> #
>> #It comes pre-configured with a number of simple transformations with
>> #paths that are likely to work on a linux box. However, on some systems,
>> #the paths to these executables will be different (for example, sometimes
>> #some of these programs are found in /usr/bin rather than in /bin)
>> #
>> #NOTE WELL: fields in this file must be separated by tabs, not spaces; and
>> #there must be no trailing whitespace at the end of each line.
>> #
>> # sitename transformation path INSTALLED platform profiles
>> bgps echo /bin/echo INSTALLED INTEL32::LINUX null
>> bgp000 cat /bin/cat INSTALLED INTEL32::LINUX null
>> localhost sleep /bin/sleep INSTALLED
>> INTEL32::LINUX null
>> localhost echo /bin/echo INSTALLED
>> INTEL32::LINUX null
>> localhost ls /bin/ls INSTALLED
>> INTEL32::LINUX null
>> localhost wc /bin/wc INSTALLED
>> INTEL32::LINUX null
>> localhost grep /bin/grep INSTALLED
>> INTEL32::LINUX null
>> localhost sort /bin/sort INSTALLED
>> INTEL32::LINUX null
>> localhost paste /bin/paste INSTALLED
>> INTEL32::LINUX null
>> localhost date /bin/date INSTALLED
>> INTEL32::LINUX null
>> localhost db /home/wilde/angle/data/db
>> INSTALLED INTEL32::LINUX null
>> localhost set1 /home/wilde/angle/data/set1
>> INSTALLED INTEL32::LINUX null
>> localhost set3 /home/wilde/angle/data/set3
>> INSTALLED INTEL32::LINUX null
>> localhost run_ampl
>> /share/home/00946/zzhang/SEE-work/static/run_ampl INSTALLED
>> INTEL32::LINUX null
>> tgtacc run_ampl
>> /share/home/00946/zzhang/SEE-work/static/run_ampl INSTALLED
>> INTEL32::LINUX null
>>
>>
>> Mihael Hategan wrote:
>>
>>> Your jobs seem to not have a walltime specified. Can you post your
>>> tc.data?
>>>
>>> On Thu, 2009-06-11 at 10:37 -0500, Zhao Zhang wrote:
>>>
>>>
>>>> Hi, Mihael
>>>>
>>>> The coaster log is at /home/zzhang/see/logs/coasters.log. The latest
>>>> record should be the run that failed last night.
>>>>
>>>> best
>>>> zhao
>>>>
>>>> Mihael Hategan wrote:
>>>>
>>>>
>>>>> On Thu, 2009-06-11 at 09:24 -0500, Zhao Zhang wrote:
>>>>>
>>>>>
>>>>>
>>>>>> Hi, Mike and Mihael
>>>>>>
>>>>>> Here is the error, I think this is related to the job wall time of
>>>>>> coaster settings.
>>>>>>
>>>>>> Mihael, could you give me some suggestions on how to set the parameters
>>>>>> for coasters on ranger?
>>>>>>
>>>>>>
>>>>>>
>>>>> I need to know what the problem is first. And for that I need to take a
>>>>> look at the coaster log (and possibly gram logs). So if you could copy
>>>>> that to some shared space in the CI, that would be good.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> For now I am running 100 jobs, each job could take 2~3 hours. Thanks.
>>>>>>
>>>>>> best
>>>>>> zhao
>>>>>>
>>>>>> Execution failed:
>>>>>> Exception in run_ampl:
>>>>>> Arguments: [run70, template, armington.mod, armington_process.cmd,
>>>>>> armington_ou\
>>>>>> tput.cmd, subproblems/producer_tree.mod, ces.so]
>>>>>> Host: tgtacc
>>>>>> Directory: ampl-20090611-0122-hzktisu5/jobs/h/run_ampl-h92ap3cj
>>>>>> stderr.txt:
>>>>>>
>>>>>> stdout.txt:
>>>>>> ----
>>>>>>
>>>>>> Caused by:
>>>>>> Shutting down worker
>>>>>> Cleaning up...
>>>>>> Shutting down service at https://129.114.50.163:58556
>>>>>>
>>>>>> And here is my sites.xml
>>>>>> bash-3.00$ cat tgranger-sge-gram2.xml
>>>>>> <config>
>>>>>> <pool handle="tgtacc" >
>>>>>> <gridftp url="gsiftp://gridftp.ranger.tacc.teragrid.org" />
>>>>>> <execution provider="coaster"
>>>>>> url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/>
>>>>>> <!-- <profile namespace="globus"
>>>>>> key="project">TG-DBS080004N</profile> -->
>>>>>> <profile namespace="globus" key="project">TG-CCR080022N</profile>
>>>>>> <workdirectory >/work/00946/zzhang/work</workdirectory>
>>>>>> <profile namespace="env"
>>>>>> key="SWIFT_JOBDIR_PATH">/tmp/zzhang/jobdir</profile>
>>>>>> <profile namespace="globus" key="coastersPerNode">16</profile>
>>>>>> <profile namespace="globus" key="queue">development</profile>
>>>>>> <profile namespace="karajan" key="initialScore">100</profile>
>>>>>> <profile namespace="karajan" key="jobThrottle">10</profile>
>>>>>> <profile namespace="globus" key="slots">20</profile>
>>>>>> <profile namespace="globus" key="lowOverAllocation">5</profile>
>>>>>> <profile namespace="globus" key="highOverAllocation">1</profile>
>>>>>> <profile namespace="globus" key="maxNodes">5</profile>
>>>>>> </pool>
>>>>>> </config>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>
>
>
>
More information about the Swift-devel
mailing list