[Swift-user] Montage+Swift+Coasters
Ketan Maheshwari
ketancmaheshwari at gmail.com
Mon Jan 23 13:36:08 CST 2012
Emalayan,
Likely, /tmp is not readable/writable across the machines. Could you try
changing workdir to your /home
On Mon, Jan 23, 2012 at 1:25 PM, Emalayan Vairavanathan <
svemalayan at yahoo.com> wrote:
> Jon,
>
> Please find the detail below and let me know if you have any questions
> about my setup.
>
> Thank you
> Emalayan
>
> ==========================================================
> site.xml
>
> <config>
> <pool handle="localhost">
> <execution provider="coaster-persistent" url="http://localhost:1984"
> jobmanager="local:local"/>
> <profile namespace="globus" key="workerManager">passive</profile>
>
> <profile namespace="globus" key="workersPerNode">4</profile>
> <profile namespace="globus" key="maxTime">100000</profile>
> <profile namespace="globus" key="lowOverAllocation">100</profile>
> <profile namespace="globus" key="highOverAllocation">100</profile>
> <profile namespace="globus" key="slots">100</profile>
> <profile namespace="globus" key="nodeGranularity">1</profile>
> <profile namespace="globus" key="maxNodes">10</profile>
> <profile namespace="karajan" key="jobThrottle">25.00</profile>
> <profile namespace="karajan" key="initialScore">10000</profile>
> <profile namespace="swift" key="stagingMethod">proxy</profile>
> <filesystem provider="local"/>
> <workdirectory>/tmp/swift.workdir</workdirectory>
> </pool>
> </config>
>
> =======================================================
>
> tc
>
> localhost sh /bin/sh null null null
> localhost cat /bin/cat null null null
> localhost echo /bin/echo null null null
> localhost do_merge /home/emalayan/App/forEmalayan/app/modmerge null null
> null
> localhost mProjExec /home/emalayan/App/Montage_v3.3/bin/mProjExec null
> null null
> localhost mImgtbl /home/emalayan/App/Montage_v3.3/bin/mImgtbl null null
> null
> localhost mAdd /home/emalayan/App/Montage_v3.3/bin/mAdd null null null
> localhost mOverlaps /home/emalayan/App/Montage_v3.3/bin/mOverlaps null
> null null
> localhost mJPEG /home/emalayan/App/Montage_v3.3/bin/mJPEG null null null
> localhost mDiffExec_wrap /home/emalayan/App/Montage_v3.3/bin/mDiffExec
> null null null
> localhost mFitExec /home/emalayan/App/Montage_v3.3/bin/mFitExec null null
> null
> localhost mBgModel /home/emalayan/App/Montage_v3.3/bin/mBgModel null null
> null
> localhost mBgExec /home/emalayan/App/Montage_v3.3/bin/mBgExec null null
> null
> localhost mConcatFit /home/emalayan/App/Montage_v3.3/bin/mConcatFit null
> null nul
>
> localhost Background_list
> /home/emalayan/App/montage-swift/SwiftMontage/apps/Background_list.py null
> null null
> localhost create_status_table
> /home/emalayan/App/montage-swift/SwiftMontage/apps/create_status_table.py
> null null null
> localhost mProjectPP_wrap
> /home/emalayan/App/montage-swift/SwiftMontage/apps/mProjectPP_wrap.py null
> null null
> localhost mProject_wrap
> /home/emalayan/App/montage-swift/SwiftMontage/apps/mProject_wrap.py null
> null null
> localhost mBackground_wrap
> /home/emalayan/App/montage-swift/SwiftMontage/apps/mBackground_wrap.py null
> null null
> localhost mDiffFit_wrap
> /home/emalayan/App/montage-swift/SwiftMontage/apps/mDiffFit_wrap.py null
> null null
>
> =================================================================
>
> cf
>
> wrapperlog.always.transfer=true
> sitedir.keep=true
> execution.retries=1
> lazy.errors=true
> status.mode=provider
> use.provider.staging=true
> provider.staging.pin.swiftfiles=false
> foreach.max.threads=100
> provenance.log=false
>
> ===================================================================
>
> ------------------------------
> *From:* Jonathan Monette <jonmon at mcs.anl.gov>
> *To:* Ketan Maheshwari <ketancmaheshwari at gmail.com>
> *Cc:* Emalayan Vairavanathan <svemalayan at yahoo.com>; swift user <
> swift-user at ci.uchicago.edu>
> *Sent:* Monday, 23 January 2012 11:08 AM
> *Subject:* Re: [Swift-user] Montage+Swift+Coasters
>
> Emalayan,
> So I have ran the scripts with some of my own test cases and do not see
> it failing. Could you provide your config files? Please provide the tc,
> sites, and config file(if you use a config file).
>
> On Jan 20, 2012, at 9:39 AM, Ketan Maheshwari wrote:
>
> Emalayan,
>
> I would check all the mappers and the resulting paths in the Swift source.
>
> Also try running the failed job something like this:
>
> cd <swift.workdir>/SwiftMontage-20120119-1749-rjshh1r9/jobs/b/mConcatFit-
> b1sa4vlk
> *
> *
> mConcatFit _concurrent/status_tbl-7a8340c2-045d-4039-a77c-00429b78d9c9-5
> fits.tbl stat_dir
>
> error 520 indicates workers are not able to reach the data.
>
> Also check if swift.workdir is writable on the site by the worker nodes.
>
> On Thu, Jan 19, 2012 at 7:55 PM, Emalayan Vairavanathan <
> svemalayan at yahoo.com> wrote:
>
> Hi Ketan,
>
> This was with swift-0.92.1. Now I have downloaded the latest swift 0.93
> and getting totally different error messages with swift 0.93. I can ask
> Jon about these messages. (These scripts was working well with only Swift)
>
> Please let me know if you have any idea.
>
> Regards
> Emalayan
>
>
> ===============================================================================================
> Swift 0.93 swift-r5501 cog-r3350
>
> RunID: 20120119-1749-rjshh1r9
> (input): found 10 files
> Progress: time: Thu, 19 Jan 2012 17:49:20 -0800
> Find: http://localhost:1984
> Find: keepalive(120), reconnect - http://localhost:1984
> Progress: time: Thu, 19 Jan 2012 17:49:22 -0800 Stage in:1 Submitted:9
> Progress: time: Thu, 19 Jan 2012 17:49:25 -0800 Active:9 Stage out:1
> Progress: time: Thu, 19 Jan 2012 17:49:26 -0800 Stage out:3 Finished
> successfully:7
> Progress: time: Thu, 19 Jan 2012 17:49:28 -0800 Active:1 Finished
> successfully:10
> Progress: time: Thu, 19 Jan 2012 17:49:29 -0800 Stage in:1
> Submitting:11 Submitted:6 Finished successfully:12
> Progress: time: Thu, 19 Jan 2012 17:49:30 -0800 Stage in:4 Submitted:1
> Active:6 Stage out:2 Finished successfully:17
> Progress: time: Thu, 19 Jan 2012 17:49:31 -0800 Active:1 Finished
> successfully:30
> Exception in mConcatFit:
> Arguments: [_concurrent/status_tbl-7a8340c2-045d-4039-a77c-00429b78d9c9-5,
> fits.tbl, stat_dir]
> Host: localhost
> Directory: SwiftMontage-20120119-1749-rjshh1r9/jobs/b/mConcatFit-b1sa4vlk
> - - -
>
> Caused by: null
> Caused by: org.globus.cog.abstraction.impl.common.execution.JobException:
> Job failed with an exit code of 520
> Execution failed:
> back_list:Table = org.griphyn.vdl.mapping.DataDependentException -
> Closed not derived due to errors in data dependencies
>
> ------------------------------
> *From:* Ketan Maheshwari <ketancmaheshwari at gmail.com>
> *To:* Emalayan Vairavanathan <svemalayan at yahoo.com>
> *Cc:* swift user <swift-user at ci.uchicago.edu>
> *Sent:* Thursday, 19 January 2012 4:49 PM
> *Subject:* Re: [Swift-user] Montage+Swift+Coasters
>
> Emalayan,
>
> From your symptoms, it seems you are facing the same issue as I've been.
> Could you tell more about the amount of data that needs to be staged to run
> the Montage stages during which these warnings turn up? How much time
> elapses since the start of your workflow after which you see these messages?
>
> Also, what version of Swift is this?
>
> Regards,
> Ketan
>
> On Thu, Jan 19, 2012 at 5:51 PM, Emalayan Vairavanathan <
> svemalayan at yahoo.com> wrote:
>
> Dear All,
>
> I have a problem in running Montage with Coasters (in our local cluster
> - no batch schedulers). After few stages the swift run-time continuously
> prints the warnings below. Any ideas ? Should I increase the heartbeat
> count ?
>
> Everything works fine when I try to run the same montage-scripts with
> swift on a single machine.
>
> Thank you
> Emalayan
>
>
> 2012-01-19 15:38:09,207-0800 WARN Command Command(119, HEARTBEAT):
> handling reply timeout; sendReqTime=120119-153609.206,
> sendTime=120119-153609.206, now=120119-153809.207
> 2012-01-19 15:38:09,207-0800 INFO Command Command(119, HEARTBEAT):
> re-sending
> 2012-01-19 15:38:09,209-0800 WARN Command Command(119, HEARTBEAT)fault
> was: Reply timeout
> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
> at
> org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)
> at
> org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:293)
> at java.util.TimerThread.mainLoop(Timer.java:534)
> at java.util.TimerThread.run(Timer.java:484)
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>
>
>
>
> --
> Ketan
>
>
>
>
>
>
>
> --
> Ketan
>
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>
>
>
>
>
--
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20120123/b718b861/attachment.html>
More information about the Swift-user
mailing list