[Swift-devel] WORKER_INIT_CMD - with log file

Emalayan Vairavanathan svemalayan at yahoo.com
Tue Feb 28 23:44:08 CST 2012


Hi All,

Today I tried run 001-catsn-surveyor.swift script with Mike's help. But I am still facing some issues and it would be great if  you can shed some light on this. 

Brief overview about what I am doing:
I am trying to run a simple swift script (001-catsn-surveyor.swift) with persistent-coasters. The goal is to trying out more complex applications such as Montage and ModFTDock and then ultimately integrating MosaStore+Swift+Applications. 


Current Problem: Workers could not connect to coaster-service with 001-catsn-surveyor.swift.

Steps taken:

I was doing the steps below (in Surveyor with Swift available in ~/wozniak/Public/swift/bin/swift).


1) Set the port-numbers, IP address and Nodes in coaster-service.conf  

   (LOCAL_PORT=22346, SERVICE_PORT=22356, IPADDR=172.17.3.12 and NODES=3)

2) Set environment variable GLOBUS_HOSTNAME=172.17.3.12

3) Launched coaster-service from 172.17.3.12 (inside ~/emalayan/app/swift-test folder)


4) Launched 001-catsn-surveyor.swift using run.sh from 172.17.3.12 (inside ~/emalayan/app/swift-test folder)

But the 001-catsn-surveyor.swift did not make any progress. I have observed that the nodes were allocated and running via cqstat. When I checked the worker log files I figured-out that the workers were unable to connect to coaster-service. 

Then I tried to connect to the compute-nodes to see whether workers are actually running there. But Icould not connect to compute-nodes from IO node.

I repeated the same steps againwith NODES=64 and 1024 to see whether this problem (inability to connect to coaster-service) is coupled with the number of nodes setting in coaster-service.conf  (which was initially  3). But I observed the same behavior.


In order to find-out whether this is because of some network configuration issues in Surveyor, I tried to run ModFTDock+Swift (available in ~/emalayan/app/forEmalayan_ccGrdid) with coasters. It was successfully running and also I were able to connect to compute nodes without any issues during the application run.


You can find the  001-catsn-surveyor.swift script, config files and log files inside ~/emalayan/app/swift-test and ~/emalayan/app/swift-test/log folders.

I highly appreciate your input. Please let me know if you have questions.

Thank you
Emalayan



From: Emalayan Vairavanathan <svemalayan at yahoo.com>
To: Justin M Wozniak <wozniak at mcs.anl.gov> 
Cc: Jonathan Monette <jonmon at mcs.anl.gov>; swift user <swift-user at ci.uchicago.edu>; matei <matei at ece.ubc.ca> 
Sent: Thursday, 23 February 2012 4:28 PM
Subject: Re: WORKER_INIT_CMD - with log file
 

Thank you. I was not aware about that.

Now I am getting the error below. Am I missing some configurations ?

Thank you
Emalayan


Swift trunk swift-r5662 (swift modified locally) cog-r3361 (cog modified locally)

RunID: 20120224-0021-jtfvfc90
Progress:  time: Fri, 24 Feb 2012 00:21:32 +0000
Find: http://172.17.3.12:12346
Find:  keepalive(120), reconnect - http://172.17.3.12:12346
Failed to transfer wrapper log for job cat-leiykink
Failed to transfer wrapper log for job cat-qeiykink
Failed to transfer wrapper log for job cat-jeiykink
Failed to transfer wrapper log for job cat-meiykink
Failed to transfer wrapper log for job cat-oeiykink
Failed to transfer wrapper log for job cat-peiykink
Failed to transfer wrapper log for job cat-keiykink
Failed to transfer wrapper log for job cat-seiykink
Failed to transfer wrapper log for job cat-neiykink
Progress:  time: Fri, 24 Feb 2012 00:21:33 +0000  Stage in:3  Submitting:5 Failed but can retry:2
Failed to transfer wrapper log for job cat-reiykink
Failed to transfer wrapper log for job cat-3fiykink
Failed to transfer wrapper log for job cat-7fiykink
Failed to transfer wrapper log for job cat-8fiykink
Failed to transfer wrapper log for job cat-bfiykink
Failed to transfer wrapper log for job cat-zeiykink
Failed to transfer wrapper log for job cat-0fiykink
Failed to transfer wrapper log for job cat-afiykink
Failed to transfer wrapper log for job cat-yeiykink
Failed to transfer wrapper log for job cat-1fiykink
Failed to transfer wrapper log for job cat-dfiykink
Failed to transfer wrapper log for job cat-ffiykink
EXCEPTION Exception in cat:
Arguments: [data.txt]
Host: persistent-coasters
Directory: 001-catsn-surveyor-20120224-0021-jtfvfc90/jobs/f/cat-ffiykink
stderr.txt: 

stdout.txt: 

----

Caused by: Task failed: null
org.globus.cog.karajan.workflow.service.channels.ChannelException: Channel died and no contact available
    at org.globus.cog.karajan.workflow.service.channels.ChannelManager.connect(ChannelManager.java:234)
    at org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:256)
    at org.globus.cog.karajan.workflow.service.channels.ChannelManager.reserveChannel(ChannelManager.java:226)
    at org.globus.cog.abstraction.coaster.service.job.manager.Node.getChannel(Node.java:132)
    at org.globus.cog.abstraction.coaster.service.job.manager.Cpu.submit(Cpu.java:258)
    at org.globus.cog.abstraction.coaster.service.job.manager.Cpu.launchSequential(Cpu.java:213)
    at org.globus.cog.abstraction.coaster.service.job.manager.Cpu.launch(Cpu.java:199)
    at org.globus.cog.abstraction.coaster.service.job.manager.Cpu.pull(Cpu.java:169)
    at org.globus.cog.abstraction.coaster.service.job.manager.PullThread.run(PullThread.java:114)


Execution failed:
    Failed to transfer wrapper log for job cat-mfiykink
EXCEPTION Exception in cat:
Arguments: [data.txt]
Host: persistent-coasters
Directory: 001-catsn-surveyor-20120224-0021-jtfvfc90/jobs/m/cat-mfiykink
stderr.txt: 

stdout.txt: 

----



________________________________
 From: Justin M Wozniak <wozniak at mcs.anl.gov>
To: Emalayan Vairavanathan <svemalayan at yahoo.com> 
Cc: Jonathan Monette <jonmon at mcs.anl.gov>; swift user <swift-user at ci.uchicago.edu>; matei <matei at ece.ubc.ca> 
Sent: Thursday, 23 February 2012 4:08 PM
Subject: Re: WORKER_INIT_CMD - with log file
 

Do the pool names agree?  You may want to check that you are using the 
right tc file.  You might need tc.persistent.data .

On Thu, 23 Feb 2012, Emalayan Vairavanathan wrote:

> Hi Justin,
>
> I copied swift-test from ~wozniak/Public/swift-test and try to run it. I followed the steps below.
>
> - Modified the
 coaster-service-conf
>
> - Started the coaster service.
> - Started swift
>
>
> The tc.data has entries for cat but I am getting the error below. Do you have any ideas ?
>
>
> Thank you
> Emalayan
>
> emalayan at login2.surveyor:~/swift-test> run.sh 
>
> Swift trunk swift-r5662 (swift modified locally) cog-r3361 (cog modified locally)
>
> RunID: 20120224-0000-rec5hd4a
> Progress:  time: Fri, 24 Feb 2012 00:00:45 +0000
> EXCEPTION The application "cat" is not available in the given site/pool in your tc.data catalog 
> Caused by: org.globus.cog.karajan.scheduler.NoSuchResourceException
> EXCEPTION The application "cat" is not available in the given site/pool in your tc.data catalog 
> Caused by:
 org.globus.cog.karajan.scheduler.NoSuchResourceException
> EXCEPTION The application "cat" is not available in the given site/pool in your tc.data catalog 
> Caused by: org.globus.cog.karajan.scheduler.NoSuchResourceException
> EXCEPTION The application "cat" is not available in the given site/pool in your tc.data catalog 
> Caused by: org.globus.cog.karajan.scheduler.NoSuchResourceException
> EXCEPTION The application "cat" is not available in the given site/pool in your tc.data catalog 
> Caused by: org.globus.cog.karajan.scheduler.NoSuchResourceException
> EXCEPTION The application "cat" is not available in the given site/pool in your tc.data catalog 
> Caused by: org.globus.cog.karajan.scheduler.NoSuchResourceException
> EXCEPTION The application "cat" is not available in the given site/pool in your tc.data catalog 
> Caused by:
 org.globus.cog.karajan.scheduler.NoSuchResourceException
> EXCEPTION The application "cat" is not available in the given site/pool in your tc.data catalog 
> Caused by: org.globus.cog.karajan.scheduler.NoSuchResourceException
> EXCEPTION The application "cat" is not available in the given site/pool in your tc.data catalog 
> Caused by: org.globus.cog.karajan.scheduler.NoSuchResourceException
> Execution failed:
>     EXCEPTION The application "cat" is not available in the given site/pool in your tc.data catalog
> Caused by: org.globus.cog.karajan.scheduler.NoSuchResourceException
> The application "cat" is not available in the given site/pool in your tc.data catalog 
>
>
>
>
>
> ________________________________
> From: Justin M Wozniak <wozniak at mcs.anl.gov>
> To: Jonathan Monette <jonmon at mcs.anl.gov> 
> Cc: Emalayan Vairavanathan <svemalayan at yahoo.com>; matei <matei at ece.ubc.ca> 
> Sent: Tuesday, 21 February 2012 12:40 PM
> Subject: Re: WORKER_INIT_CMD - with log file
> 
> Hi guys
>
> Yes, that version is based on trunk and is up-to-date with 
> WORKER_INIT_CMD.  The recent bug fix for the BG/P is in there, I just 
> tested it.
>
> I moved the location to ~wozniak/Public/swift .  The test case I used is 
> in ~wozniak/Public/swift-test .  Both should be readable.
>
>    
 Justin
>
> On Mon, 20 Feb 2012, Jonathan Monette wrote:
>
>> It might....that is a question that Justin can answer.  If it doesn't I am sure the feature can be quickly added.
>>
>> On Feb 20, 2012, at 5:40 PM, Emalayan Vairavanathan <svemalayan at yahoo.com> wrote:
>>
>>> Hi Jon,
>>>
>>> I didn't try with the swift-version available in Justin's home directory. I can try and tell it now.
>>>
>>> But just a quick question: Does this version has WORKER_INIT_CMD ?
>>>
>>> Thank you
>>> Emalayan
>>>
>>> From: Jonathan Monette <jonmon at mcs.anl.gov>
>>> To: Emalayan Vairavanathan <svemalayan at yahoo.com>
>>> Cc: Justin Wozniak <wozniak at mcs.anl.gov>; matei <matei at ece.ubc.ca>
>>> Sent: Monday, 20 February 2012 3:35 PM
>>> Subject: Re: WORKER_INIT_CMD - with log file
>>>
>>> I suggest to use the swift version in Justin's directory. This is the stable version for the bg/p. If you already are using it, then let me debug further.
>>>
>>> What login host where you running on, login1 or login2?
>>>
>>> On Feb 20, 2012, at 3:13 PM, Emalayan Vairavanathan <svemalayan at yahoo.com> wrote:
>>>
>>>>
>>>> Hi Jon and Justin,
>>>>
>>>> I checkout the swift-code from trunk and try to see whether ModFTDock+Swift works on Surveyor (without MosaStore). But the job did not complete for a long time.
>>>>
>>>> Could you please have a look ?
>>>>
>>>> (As I can remember, last time there was a bug when I tried to launch swift on Surveyor. Justin fixed the bug and asked me to use the swift executables from his home directory. May be this fix is not available in the trunk ?)
>>>>
>>>> Thank you
>>>> Emalayan
>>>>
>>>> From: Emalayan Vairavanathan <svemalayan at yahoo.com>
>>>> To: Jonathan Monette <jonmon at mcs.anl.gov>; "emalayan at ece.ubc.ca" <emalayan at ece.ubc.ca>
>>>> Cc: Justin Wozniak <wozniak at mcs.anl.gov>; matei <matei at ece.ubc.ca>; MosaStore <mosastore at googlegroups.com>
>>>> Sent: Monday, 13 February 2012 3:08 PM
>>>> Subject: Re: WORKER_INIT_CMD
>>>>
>>>> Thank you very much Jon. I will ask you if I have questions.
>>>>
>>>> Regards
>>>>
 Emalayan
>>>>
>>>> From: Jonathan Monette <jonmon at mcs.anl.gov>
>>>> To: emalayan at ece.ubc.ca
>>>> Cc: Justin Wozniak <wozniak at mcs.anl.gov>
>>>> Sent: Sunday, 12 February 2012 4:50 PM
>>>> Subject: WORKER_INIT_CMD
>>>>
>>>> Emalayan,
>>>>    We have now added an environment variable to the worker script.  The variable is called WORKER_INIT_CMD and works like so:
>>>>
>>>> export WORKER_INIT_CMD=<path/to/script.to.run>
>>>>
>>>> The worker will then run this script before entering it's main loop that
 waits for Swift apps to run.  You have to use manual coasters to use this variable, which I believe you already are doing.
>>>>
>>>> Let me know if you have any questions about this env variable.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> <modftdock-log.tar.gz>
>>>
>>>
>>
>
> -- 
> Justin M Wozniak

-- 
Justin M Wozniak
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20120228/cf22a3da/attachment.html>


More information about the Swift-devel mailing list