[Swift-user] Montage+Swift+Coasters

Ketan Maheshwari ketancmaheshwari at gmail.com
Mon Jan 23 16:28:46 CST 2012


On Mon, Jan 23, 2012 at 2:25 PM, Emalayan Vairavanathan <
svemalayan at yahoo.com> wrote:

> I am using swift-0.93. I started only the coaster-service manually using
> following command (workers were started automatically).
>

Are you aware that workers will start automatically *only* on the localhost
where the service is running and not on the remote nodes.


>
> coaster-service -port 1984 -localport 35753 -nosec
>
> Then application prints following output and terminates. (I have attached
> the log file with this mail. Please discard the previous log file because
> system was not configured properly)
>
> Please let me know if you need more information.
>
> Thank you
> Emalayan
>
>
> ====================================================================================
> Swift 0.93 swift-r5501 (swift modified locally) cog-r3350
>
> RunID: 20120123-1219-zj95uaye
>  (input): found 10 files
> Progress:  time: Mon, 23 Jan 2012 12:19:39 -0800
>
> Find: http://localhost:1984
> Find:  keepalive(120), reconnect - http://localhost:1984
> Progress:  time: Mon, 23 Jan 2012 12:19:41 -0800  Stage in:1  Submitted:9
> Progress:  time: Mon, 23 Jan 2012 12:19:45 -0800  Active:9  Stage out:1
> Progress:  time: Mon, 23 Jan 2012 12:19:46 -0800  Active:6  Stage out:2
> Finished successfully:2
> Progress:  time: Mon, 23 Jan 2012 12:19:47 -0800  Submitted:1  Finished
> successfully:10
> Progress:  time: Mon, 23 Jan 2012 12:19:49 -0800  Active:1  Finished
> successfully:10
> Progress:  time: Mon, 23 Jan 2012 12:19:50 -0800  Submitted:1  Finished
> successfully:12
> Progress:  time: Mon, 23 Jan 2012 12:19:51 -0800  Stage in:12
> Submitted:5  Finished successfully:13
> Progress:  time: Mon, 23 Jan 2012 12:19:52 -0800  Stage in:1  Submitted:5
> Active:9  Stage out:2  Finished successfully:13
> Progress:  time: Mon, 23 Jan 2012 12:19:53 -0800  Active:5  Finished
> successfully:25
> Exception in mConcatFit:
> Arguments: [_concurrent/status_tbl-bf92dd4d-ecf0-490e-ab93-cf7863688950-5,
> fits.tbl, stat_dir]
> Host: localhost
> Directory: SwiftMontage-20120123-1219-zj95uaye/jobs/4/mConcatFit-4o2fb2mk
>
> - - -
>
> Caused by: null
> Caused by: org.globus.cog.abstraction.impl.common.execution.JobException:
> Job failed with an exit code of 520
> Execution failed:
>     back_list:Table = org.griphyn.vdl.mapping.DataDependentException -
> Closed not derived due to errors in data dependencies
> [emalayan at node090 scripts]$
>
>
>   ------------------------------
> *From:* Ketan Maheshwari <ketancmaheshwari at gmail.com>
> *To:* Emalayan Vairavanathan <svemalayan at yahoo.com>
> *Cc:* Jonathan Monette <jonmon at mcs.anl.gov>; swift user <
> swift-user at ci.uchicago.edu>
> *Sent:* Monday, 23 January 2012 11:55 AM
> *Subject:* Re: [Swift-user] Montage+Swift+Coasters
>
> How are you starting the service? Are you starting workers manually? if
> yes, could you paste commandlines for both?
>
> On Mon, Jan 23, 2012 at 1:50 PM, Emalayan Vairavanathan <
> svemalayan at yahoo.com> wrote:
>
> Thanks Ketan and Jon. I tried but it is still giving error. I have
> attached the log file.
>
> Thank you
> Emalayan
>
>   ------------------------------
> *From:* Ketan Maheshwari <ketancmaheshwari at gmail.com>
> *To:* Emalayan Vairavanathan <svemalayan at yahoo.com>
> *Cc:* Jonathan Monette <jonmon at mcs.anl.gov>; swift user <
> swift-user at ci.uchicago.edu>
> *Sent:* Monday, 23 January 2012 11:36 AM
> *Subject:* Re: [Swift-user] Montage+Swift+Coasters
>
> Emalayan,
>
> Likely, /tmp is not readable/writable across the machines. Could you try
> changing workdir to your /home
>
> On Mon, Jan 23, 2012 at 1:25 PM, Emalayan Vairavanathan <
> svemalayan at yahoo.com> wrote:
>
> Jon,
>
> Please find the detail below and let me know if you have any questions
> about my setup.
>
> Thank you
> Emalayan
>
> ==========================================================
> site.xml
>
> <config>
> <pool handle="localhost">
>     <execution provider="coaster-persistent" url="http://localhost:1984"
> jobmanager="local:local"/>
>     <profile namespace="globus" key="workerManager">passive</profile>
>
>     <profile namespace="globus" key="workersPerNode">4</profile>
>     <profile namespace="globus" key="maxTime">100000</profile>
>     <profile namespace="globus" key="lowOverAllocation">100</profile>
>     <profile namespace="globus" key="highOverAllocation">100</profile>
>     <profile namespace="globus" key="slots">100</profile>
>     <profile namespace="globus" key="nodeGranularity">1</profile>
>     <profile namespace="globus" key="maxNodes">10</profile>
>     <profile namespace="karajan" key="jobThrottle">25.00</profile>
>     <profile namespace="karajan" key="initialScore">10000</profile>
>     <profile namespace="swift" key="stagingMethod">proxy</profile>
>     <filesystem provider="local"/>
>     <workdirectory>/tmp/swift.workdir</workdirectory>
>   </pool>
> </config>
>
> =======================================================
>
> tc
>
> localhost sh /bin/sh null null null
> localhost cat /bin/cat null null null
> localhost echo /bin/echo null null null
> localhost do_merge /home/emalayan/App/forEmalayan/app/modmerge null null
> null
> localhost mProjExec /home/emalayan/App/Montage_v3.3/bin/mProjExec null
> null null
> localhost mImgtbl /home/emalayan/App/Montage_v3.3/bin/mImgtbl null null
> null
> localhost mAdd /home/emalayan/App/Montage_v3.3/bin/mAdd null null null
> localhost mOverlaps /home/emalayan/App/Montage_v3.3/bin/mOverlaps null
> null null
> localhost mJPEG /home/emalayan/App/Montage_v3.3/bin/mJPEG null null null
> localhost mDiffExec_wrap /home/emalayan/App/Montage_v3.3/bin/mDiffExec
> null null null
> localhost mFitExec /home/emalayan/App/Montage_v3.3/bin/mFitExec null null
> null
> localhost mBgModel /home/emalayan/App/Montage_v3.3/bin/mBgModel null null
> null
> localhost mBgExec /home/emalayan/App/Montage_v3.3/bin/mBgExec null null
> null
> localhost mConcatFit /home/emalayan/App/Montage_v3.3/bin/mConcatFit null
> null nul
>
> localhost Background_list
> /home/emalayan/App/montage-swift/SwiftMontage/apps/Background_list.py null
> null null
> localhost create_status_table
> /home/emalayan/App/montage-swift/SwiftMontage/apps/create_status_table.py
> null null null
> localhost mProjectPP_wrap
> /home/emalayan/App/montage-swift/SwiftMontage/apps/mProjectPP_wrap.py null
> null null
> localhost mProject_wrap
> /home/emalayan/App/montage-swift/SwiftMontage/apps/mProject_wrap.py null
> null null
> localhost mBackground_wrap
> /home/emalayan/App/montage-swift/SwiftMontage/apps/mBackground_wrap.py null
> null null
> localhost mDiffFit_wrap
> /home/emalayan/App/montage-swift/SwiftMontage/apps/mDiffFit_wrap.py null
> null null
>
> =================================================================
>
> cf
>
> wrapperlog.always.transfer=true
> sitedir.keep=true
> execution.retries=1
> lazy.errors=true
> status.mode=provider
> use.provider.staging=true
> provider.staging.pin.swiftfiles=false
> foreach.max.threads=100
> provenance.log=false
>
> ===================================================================
>
>   ------------------------------
> *From:* Jonathan Monette <jonmon at mcs.anl.gov>
> *To:* Ketan Maheshwari <ketancmaheshwari at gmail.com>
> *Cc:* Emalayan Vairavanathan <svemalayan at yahoo.com>; swift user <
> swift-user at ci.uchicago.edu>
> *Sent:* Monday, 23 January 2012 11:08 AM
> *Subject:* Re: [Swift-user] Montage+Swift+Coasters
>
> Emalayan,
>    So I have ran the scripts with some of my own test cases and do not see
> it failing.  Could you provide your config files?  Please provide the tc,
> sites, and config file(if you use a config file).
>
> On Jan 20, 2012, at 9:39 AM, Ketan Maheshwari wrote:
>
> Emalayan,
>
> I would check all the mappers and the resulting paths in the Swift source.
>
> Also try running the failed job something like this:
>
> cd <swift.workdir>/SwiftMontage-20120119-1749-rjshh1r9/jobs/b/mConcatFit-
> b1sa4vlk
> *
> *
> mConcatFit _concurrent/status_tbl-7a8340c2-045d-4039-a77c-00429b78d9c9-5
> fits.tbl stat_dir
>
> error 520 indicates workers are not able to reach the data.
>
> Also check if swift.workdir is writable on the site by the worker nodes.
>
> On Thu, Jan 19, 2012 at 7:55 PM, Emalayan Vairavanathan <
> svemalayan at yahoo.com> wrote:
>
> Hi Ketan,
>
> This was with swift-0.92.1. Now I have downloaded the latest swift 0.93
> and getting totally different error messages with swift 0.93. I can ask
> Jon about these messages. (These scripts was working well with only Swift)
>
> Please let me know if you have any idea.
>
> Regards
> Emalayan
>
>
> ===============================================================================================
> Swift 0.93 swift-r5501 cog-r3350
>
> RunID: 20120119-1749-rjshh1r9
>  (input): found 10 files
> Progress:  time: Thu, 19 Jan 2012 17:49:20 -0800
> Find: http://localhost:1984
> Find:  keepalive(120), reconnect - http://localhost:1984
> Progress:  time: Thu, 19 Jan 2012 17:49:22 -0800  Stage in:1  Submitted:9
> Progress:  time: Thu, 19 Jan 2012 17:49:25 -0800  Active:9  Stage out:1
> Progress:  time: Thu, 19 Jan 2012 17:49:26 -0800  Stage out:3  Finished
> successfully:7
> Progress:  time: Thu, 19 Jan 2012 17:49:28 -0800  Active:1  Finished
> successfully:10
> Progress:  time: Thu, 19 Jan 2012 17:49:29 -0800  Stage in:1
> Submitting:11  Submitted:6  Finished successfully:12
> Progress:  time: Thu, 19 Jan 2012 17:49:30 -0800  Stage in:4  Submitted:1
> Active:6  Stage out:2  Finished successfully:17
> Progress:  time: Thu, 19 Jan 2012 17:49:31 -0800  Active:1  Finished
> successfully:30
> Exception in mConcatFit:
> Arguments: [_concurrent/status_tbl-7a8340c2-045d-4039-a77c-00429b78d9c9-5,
> fits.tbl, stat_dir]
> Host: localhost
> Directory: SwiftMontage-20120119-1749-rjshh1r9/jobs/b/mConcatFit-b1sa4vlk
> - - -
>
> Caused by: null
> Caused by: org.globus.cog.abstraction.impl.common.execution.JobException:
> Job failed with an exit code of 520
> Execution failed:
>     back_list:Table = org.griphyn.vdl.mapping.DataDependentException -
> Closed not derived due to errors in data dependencies
>
>    ------------------------------
> *From:* Ketan Maheshwari <ketancmaheshwari at gmail.com>
> *To:* Emalayan Vairavanathan <svemalayan at yahoo.com>
> *Cc:* swift user <swift-user at ci.uchicago.edu>
> *Sent:* Thursday, 19 January 2012 4:49 PM
> *Subject:* Re: [Swift-user] Montage+Swift+Coasters
>
> Emalayan,
>
> From your symptoms, it seems you are facing the same issue as I've been.
> Could you tell more about the amount of data that needs to be staged to run
> the Montage stages during which these warnings turn up? How much time
> elapses since the start of your workflow after which you see these messages?
>
> Also, what version of Swift is this?
>
> Regards,
> Ketan
>
> On Thu, Jan 19, 2012 at 5:51 PM, Emalayan Vairavanathan <
> svemalayan at yahoo.com> wrote:
>
> Dear All,
>
>  I have a problem in running Montage with Coasters (in our local cluster
> - no batch schedulers). After few stages the swift run-time continuously
> prints the warnings below. Any ideas ? Should I increase the heartbeat
> count ?
>
> Everything works fine when I try to run the same montage-scripts with
> swift on a single machine.
>
> Thank you
> Emalayan
>
>
>  2012-01-19 15:38:09,207-0800 WARN  Command Command(119, HEARTBEAT):
> handling reply timeout; sendReqTime=120119-153609.206,
> sendTime=120119-153609.206, now=120119-153809.207
> 2012-01-19 15:38:09,207-0800 INFO  Command Command(119, HEARTBEAT):
> re-sending
> 2012-01-19 15:38:09,209-0800 WARN  Command Command(119, HEARTBEAT)fault
> was: Reply timeout
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>         at
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)
>         at
> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:293)
>         at java.util.TimerThread.mainLoop(Timer.java:534)
>         at java.util.TimerThread.run(Timer.java:484)
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>
>
>
>
> --
> Ketan
>
>
>
>
>
>
>
> --
> Ketan
>
>
>  _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>
>
>
>
>
>
>
> --
> Ketan
>
>
>
>
>
>
>
> --
> Ketan
>
>
>
>
>


-- 
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20120123/d3f39065/attachment.html>


More information about the Swift-user mailing list