[Swift-devel] Re: coaster error on ranger

Zhao Zhang zhaozhang at uchicago.edu
Thu Jun 11 13:13:34 CDT 2009


Hi, Mihael

Actually, I have no idea how long would these jobs run. Some of them 
just took ~10 minutes, and some of them went far more than this.
What if I set the wall to 120 minutes, what will happen when the wall 
time is up but the job doesn't finish?
    <profile namespace="globus" key="maxtime">120</profile>


zhao

Mihael Hategan wrote:
> On Thu, 2009-06-11 at 13:04 -0500, Zhao Zhang wrote:
>   
>> No, I don't specify any wall time.
>>     
>
> Well, you need to specify one.
>
>   
>> The last entry is for the run_ampl script.
>>
>> zhao
>>
>> login3% cat tc.data
>> #This is the transformation catalog.
>> #
>> #It comes pre-configured with a number of simple transformations with
>> #paths that are likely to work on a linux box. However, on some systems,
>> #the paths to these executables will be different (for example, sometimes
>> #some of these programs are found in /usr/bin rather than in /bin)
>> #
>> #NOTE WELL: fields in this file must be separated by tabs, not spaces; and
>> #there must be no trailing whitespace at the end of each line.
>> #
>> # sitename  transformation  path   INSTALLED  platform  profiles
>> bgps    echo            /bin/echo       INSTALLED       INTEL32::LINUX  null
>> bgp000  cat             /bin/cat        INSTALLED       INTEL32::LINUX  null
>> localhost       sleep           /bin/sleep      INSTALLED       
>> INTEL32::LINUX  null
>> localhost       echo            /bin/echo       INSTALLED       
>> INTEL32::LINUX  null
>> localhost       ls              /bin/ls         INSTALLED       
>> INTEL32::LINUX  null
>> localhost       wc              /bin/wc         INSTALLED       
>> INTEL32::LINUX  null
>> localhost       grep            /bin/grep       INSTALLED       
>> INTEL32::LINUX  null
>> localhost       sort            /bin/sort       INSTALLED       
>> INTEL32::LINUX  null
>> localhost       paste           /bin/paste      INSTALLED       
>> INTEL32::LINUX  null
>> localhost       date            /bin/date       INSTALLED       
>> INTEL32::LINUX  null
>> localhost       db              /home/wilde/angle/data/db       
>> INSTALLED       INTEL32::LINUX  null
>> localhost       set1            /home/wilde/angle/data/set1     
>> INSTALLED       INTEL32::LINUX  null
>> localhost       set3            /home/wilde/angle/data/set3     
>> INSTALLED       INTEL32::LINUX  null
>> localhost       run_ampl    
>> /share/home/00946/zzhang/SEE-work/static/run_ampl   INSTALLED       
>> INTEL32::LINUX  null
>> tgtacc       run_ampl    
>> /share/home/00946/zzhang/SEE-work/static/run_ampl        INSTALLED       
>> INTEL32::LINUX  null
>>
>>
>> Mihael Hategan wrote:
>>     
>>> Your jobs seem to not have a walltime specified. Can you post your
>>> tc.data?
>>>
>>> On Thu, 2009-06-11 at 10:37 -0500, Zhao Zhang wrote:
>>>   
>>>       
>>>> Hi, Mihael
>>>>
>>>> The coaster log is at /home/zzhang/see/logs/coasters.log. The latest 
>>>> record should be the run that failed last night.
>>>>
>>>> best
>>>> zhao
>>>>
>>>> Mihael Hategan wrote:
>>>>     
>>>>         
>>>>> On Thu, 2009-06-11 at 09:24 -0500, Zhao Zhang wrote:
>>>>>   
>>>>>       
>>>>>           
>>>>>> Hi, Mike and Mihael
>>>>>>
>>>>>> Here is the error, I think this is related to the job wall time of 
>>>>>> coaster settings.
>>>>>>
>>>>>> Mihael, could you give me some suggestions on how to set the parameters 
>>>>>> for coasters on ranger?
>>>>>>     
>>>>>>         
>>>>>>             
>>>>> I need to know what the problem is first. And for that I need to take a
>>>>> look at the coaster log (and possibly gram logs). So if you could copy
>>>>> that to some shared space in the CI, that would be good.
>>>>>
>>>>>   
>>>>>       
>>>>>           
>>>>>> For now I am running 100 jobs, each job could take 2~3 hours. Thanks.
>>>>>>
>>>>>> best
>>>>>> zhao
>>>>>>
>>>>>> Execution failed:
>>>>>>         Exception in run_ampl:
>>>>>> Arguments: [run70, template, armington.mod, armington_process.cmd, 
>>>>>> armington_ou\
>>>>>> tput.cmd, subproblems/producer_tree.mod, ces.so]
>>>>>> Host: tgtacc
>>>>>> Directory: ampl-20090611-0122-hzktisu5/jobs/h/run_ampl-h92ap3cj
>>>>>> stderr.txt:
>>>>>>
>>>>>> stdout.txt:
>>>>>> ----
>>>>>>
>>>>>> Caused by:
>>>>>>         Shutting down worker
>>>>>> Cleaning up...
>>>>>> Shutting down service at https://129.114.50.163:58556
>>>>>>
>>>>>> And here is my sites.xml
>>>>>> bash-3.00$ cat tgranger-sge-gram2.xml
>>>>>> <config>
>>>>>>   <pool handle="tgtacc" >
>>>>>>     <gridftp  url="gsiftp://gridftp.ranger.tacc.teragrid.org" />
>>>>>>     <execution  provider="coaster" 
>>>>>> url="gatekeeper.ranger.tacc.teragrid.org" jobManager="gt2:gt2:SGE"/>
>>>>>>     <!-- <profile namespace="globus" 
>>>>>> key="project">TG-DBS080004N</profile> -->
>>>>>>     <profile namespace="globus" key="project">TG-CCR080022N</profile>
>>>>>>     <workdirectory >/work/00946/zzhang/work</workdirectory>
>>>>>>     <profile namespace="env" 
>>>>>> key="SWIFT_JOBDIR_PATH">/tmp/zzhang/jobdir</profile>
>>>>>>     <profile namespace="globus" key="coastersPerNode">16</profile>
>>>>>>     <profile namespace="globus" key="queue">development</profile>
>>>>>>     <profile namespace="karajan" key="initialScore">100</profile>
>>>>>>     <profile namespace="karajan" key="jobThrottle">10</profile>
>>>>>>     <profile namespace="globus" key="slots">20</profile>
>>>>>>     <profile namespace="globus" key="lowOverAllocation">5</profile>
>>>>>>     <profile namespace="globus" key="highOverAllocation">1</profile>
>>>>>>     <profile namespace="globus" key="maxNodes">5</profile>
>>>>>>   </pool>
>>>>>> </config>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>     
>>>>>>         
>>>>>>             
>>>>>   
>>>>>       
>>>>>           
>>>   
>>>       
>
>
>   



More information about the Swift-devel mailing list