[Swift-user] Montage+Swift+Coasters
Jonathan Monette
jonmon at mcs.anl.gov
Tue Jan 24 10:58:14 CST 2012
Ketan has given a lot of tips that I would have as well.
Two things, can you set lazy.errors=false in the cf file? This may give us a different error since the script will fail immediately instead of trying to continue.
The other thing is, I have not tried these scripts with provider staging turned on. This may be what is causing the data problem. First try the above to see if we get a different error or at least better information.
On Jan 23, 2012, at 9:55 PM, Emalayan Vairavanathan wrote:
> Hi Ketan, I tired but getting the same error.
>
> From: Ketan Maheshwari <ketancmaheshwari at gmail.com>
> To: Emalayan Vairavanathan <svemalayan at yahoo.com>
> Cc: Jonathan Monette <jonmon at mcs.anl.gov>; swift user <swift-user at ci.uchicago.edu>
> Sent: Monday, 23 January 2012 7:41 PM
> Subject: Re: [Swift-user] Montage+Swift+Coasters
>
>
>
> On Mon, Jan 23, 2012 at 9:21 PM, Emalayan Vairavanathan <svemalayan at yahoo.com> wrote:
> Hi Ketan,
>
> Please find the attached source code. Also I couldn't find SwiftMontage-20120123-1219-zj95uaye/jobs/4/mConcatFit-4o2fb2mk directory inside workdir.
>
> try again setting this to false in your config:
> wrapperlog.always.transfer=true
>
>
> Please let me know if you need more information
>
> Thank you
> Emalayan
>
> From: Ketan Maheshwari <ketancmaheshwari at gmail.com>
> To: Emalayan Vairavanathan <svemalayan at yahoo.com>
> Cc: Jonathan Monette <jonmon at mcs.anl.gov>; swift user <swift-user at ci.uchicago.edu>
> Sent: Monday, 23 January 2012 6:38 PM
> Subject: Re: [Swift-user] Montage+Swift+Coasters
>
>
>
> On Mon, Jan 23, 2012 at 4:52 PM, Emalayan Vairavanathan <svemalayan at yahoo.com> wrote:
> Hi Ketan,
>
> Please find my answers below.
>
> [Ketan] Emalayan, Could you also send your swift source.
> [Emalayan] did you ask for the Montage swift scripts ? / swift-0.93 source code ?
>
> Montage script
>
>
>
> [Ketan] Have you tried running mConcatFit from within the SwiftMontage-20120123-1219-zj95uaye/jobs/4/mConcatFit-4o2fb2mk directory?
> [Emalayan] There were not such directory created.
>
> should be in your workdir.
>
>
>
> [Ketan] Are you aware that workers will start automatically *only* on the localhost where the service is running and not on the remote nodes.
> [Emalayan] Yes, I am aware about this. I ran both coaster-service and application scripts on the same node. But would like to know about setting up workers on other nodes too.
>
> you may run worker.pl manually. or better put in a for loop in a simple shell script to run multiple workers. commandline is something like: worker.pl <serviceip:port> label /path/to/log
>
>
> Thank you
> Emalayan
>
> From: Ketan Maheshwari <ketancmaheshwari at gmail.com>
> To: Emalayan Vairavanathan <svemalayan at yahoo.com>
> Cc: Jonathan Monette <jonmon at mcs.anl.gov>; swift user <swift-user at ci.uchicago.edu>
> Sent: Monday, 23 January 2012 12:57 PM
> Subject: Re: [Swift-user] Montage+Swift+Coasters
>
> Emalayan, Could you also send your swift source.
>
> Have you tried running mConcatFit from within the SwiftMontage-20120123-1219-zj95uaye/jobs/4/mConcatFit-4o2fb2mk directory?
>
> On Mon, Jan 23, 2012 at 2:25 PM, Emalayan Vairavanathan <svemalayan at yahoo.com> wrote:
> I am using swift-0.93. I started only the coaster-service manually using following command (workers were started automatically).
>
> coaster-service -port 1984 -localport 35753 -nosec
>
> Then application prints following output and terminates. (I have attached the log file with this mail. Please discard the previous log file because system was not configured properly)
>
> Please let me know if you need more information.
>
> Thank you
> Emalayan
>
> ====================================================================================
> Swift 0.93 swift-r5501 (swift modified locally) cog-r3350
>
> RunID: 20120123-1219-zj95uaye
> (input): found 10 files
> Progress: time: Mon, 23 Jan 2012 12:19:39 -0800
>
> Find: http://localhost:1984
> Find: keepalive(120), reconnect - http://localhost:1984
> Progress: time: Mon, 23 Jan 2012 12:19:41 -0800 Stage in:1 Submitted:9
> Progress: time: Mon, 23 Jan 2012 12:19:45 -0800 Active:9 Stage out:1
> Progress: time: Mon, 23 Jan 2012 12:19:46 -0800 Active:6 Stage out:2 Finished successfully:2
> Progress: time: Mon, 23 Jan 2012 12:19:47 -0800 Submitted:1 Finished successfully:10
> Progress: time: Mon, 23 Jan 2012 12:19:49 -0800 Active:1 Finished successfully:10
> Progress: time: Mon, 23 Jan 2012 12:19:50 -0800 Submitted:1 Finished successfully:12
> Progress: time: Mon, 23 Jan 2012 12:19:51 -0800 Stage in:12 Submitted:5 Finished successfully:13
> Progress: time: Mon, 23 Jan 2012 12:19:52 -0800 Stage in:1 Submitted:5 Active:9 Stage out:2 Finished successfully:13
> Progress: time: Mon, 23 Jan 2012 12:19:53 -0800 Active:5 Finished successfully:25
> Exception in mConcatFit:
> Arguments: [_concurrent/status_tbl-bf92dd4d-ecf0-490e-ab93-cf7863688950-5, fits.tbl, stat_dir]
> Host: localhost
> Directory: SwiftMontage-20120123-1219-zj95uaye/jobs/4/mConcatFit-4o2fb2mk
>
> - - -
>
> Caused by: null
> Caused by: org.globus.cog.abstraction.impl.common.execution.JobException: Job failed with an exit code of 520
> Execution failed:
> back_list:Table = org.griphyn.vdl.mapping.DataDependentException - Closed not derived due to errors in data dependencies
> [emalayan at node090 scripts]$
>
>
> From: Ketan Maheshwari <ketancmaheshwari at gmail.com>
> To: Emalayan Vairavanathan <svemalayan at yahoo.com>
> Cc: Jonathan Monette <jonmon at mcs.anl.gov>; swift user <swift-user at ci.uchicago.edu>
> Sent: Monday, 23 January 2012 11:55 AM
> Subject: Re: [Swift-user] Montage+Swift+Coasters
>
> How are you starting the service? Are you starting workers manually? if yes, could you paste commandlines for both?
>
> On Mon, Jan 23, 2012 at 1:50 PM, Emalayan Vairavanathan <svemalayan at yahoo.com> wrote:
> Thanks Ketan and Jon. I tried but it is still giving error. I have attached the log file.
>
> Thank you
> Emalayan
>
> From: Ketan Maheshwari <ketancmaheshwari at gmail.com>
> To: Emalayan Vairavanathan <svemalayan at yahoo.com>
> Cc: Jonathan Monette <jonmon at mcs.anl.gov>; swift user <swift-user at ci.uchicago.edu>
> Sent: Monday, 23 January 2012 11:36 AM
> Subject: Re: [Swift-user] Montage+Swift+Coasters
>
> Emalayan,
>
> Likely, /tmp is not readable/writable across the machines. Could you try changing workdir to your /home
>
> On Mon, Jan 23, 2012 at 1:25 PM, Emalayan Vairavanathan <svemalayan at yahoo.com> wrote:
> Jon,
>
> Please find the detail below and let me know if you have any questions about my setup.
>
> Thank you
> Emalayan
>
> ==========================================================
> site.xml
>
> <config>
> <pool handle="localhost">
> <execution provider="coaster-persistent" url="http://localhost:1984" jobmanager="local:local"/>
> <profile namespace="globus" key="workerManager">passive</profile>
>
> <profile namespace="globus" key="workersPerNode">4</profile>
> <profile namespace="globus" key="maxTime">100000</profile>
> <profile namespace="globus" key="lowOverAllocation">100</profile>
> <profile namespace="globus" key="highOverAllocation">100</profile>
> <profile namespace="globus" key="slots">100</profile>
> <profile namespace="globus" key="nodeGranularity">1</profile>
> <profile namespace="globus" key="maxNodes">10</profile>
> <profile namespace="karajan" key="jobThrottle">25.00</profile>
> <profile namespace="karajan" key="initialScore">10000</profile>
> <profile namespace="swift" key="stagingMethod">proxy</profile>
> <filesystem provider="local"/>
> <workdirectory>/tmp/swift.workdir</workdirectory>
> </pool>
> </config>
>
> =======================================================
>
> tc
>
> localhost sh /bin/sh null null null
> localhost cat /bin/cat null null null
> localhost echo /bin/echo null null null
> localhost do_merge /home/emalayan/App/forEmalayan/app/modmerge null null null
> localhost mProjExec /home/emalayan/App/Montage_v3.3/bin/mProjExec null null null
> localhost mImgtbl /home/emalayan/App/Montage_v3.3/bin/mImgtbl null null null
> localhost mAdd /home/emalayan/App/Montage_v3.3/bin/mAdd null null null
> localhost mOverlaps /home/emalayan/App/Montage_v3.3/bin/mOverlaps null null null
> localhost mJPEG /home/emalayan/App/Montage_v3.3/bin/mJPEG null null null
> localhost mDiffExec_wrap /home/emalayan/App/Montage_v3.3/bin/mDiffExec null null null
> localhost mFitExec /home/emalayan/App/Montage_v3.3/bin/mFitExec null null null
> localhost mBgModel /home/emalayan/App/Montage_v3.3/bin/mBgModel null null null
> localhost mBgExec /home/emalayan/App/Montage_v3.3/bin/mBgExec null null null
> localhost mConcatFit /home/emalayan/App/Montage_v3.3/bin/mConcatFit null null nul
>
> localhost Background_list /home/emalayan/App/montage-swift/SwiftMontage/apps/Background_list.py null null null
> localhost create_status_table /home/emalayan/App/montage-swift/SwiftMontage/apps/create_status_table.py null null null
> localhost mProjectPP_wrap /home/emalayan/App/montage-swift/SwiftMontage/apps/mProjectPP_wrap.py null null null
> localhost mProject_wrap /home/emalayan/App/montage-swift/SwiftMontage/apps/mProject_wrap.py null null null
> localhost mBackground_wrap /home/emalayan/App/montage-swift/SwiftMontage/apps/mBackground_wrap.py null null null
> localhost mDiffFit_wrap /home/emalayan/App/montage-swift/SwiftMontage/apps/mDiffFit_wrap.py null null null
>
> =================================================================
>
> cf
>
> wrapperlog.always.transfer=true
> sitedir.keep=true
> execution.retries=1
> lazy.errors=true
> status.mode=provider
> use.provider.staging=true
> provider.staging.pin.swiftfiles=false
> foreach.max.threads=100
> provenance.log=false
>
> ===================================================================
>
> From: Jonathan Monette <jonmon at mcs.anl.gov>
> To: Ketan Maheshwari <ketancmaheshwari at gmail.com>
> Cc: Emalayan Vairavanathan <svemalayan at yahoo.com>; swift user <swift-user at ci.uchicago.edu>
> Sent: Monday, 23 January 2012 11:08 AM
> Subject: Re: [Swift-user] Montage+Swift+Coasters
>
> Emalayan,
> So I have ran the scripts with some of my own test cases and do not see it failing. Could you provide your config files? Please provide the tc, sites, and config file(if you use a config file).
>
> On Jan 20, 2012, at 9:39 AM, Ketan Maheshwari wrote:
>
>> Emalayan,
>>
>> I would check all the mappers and the resulting paths in the Swift source.
>>
>> Also try running the failed job something like this:
>>
>> cd <swift.workdir>/SwiftMontage-20120119-1749-rjshh1r9/jobs/b/mConcatFit-b1sa4vlk
>>
>> mConcatFit _concurrent/status_tbl-7a8340c2-045d-4039-a77c-00429b78d9c9-5 fits.tbl stat_dir
>>
>> error 520 indicates workers are not able to reach the data.
>>
>> Also check if swift.workdir is writable on the site by the worker nodes.
>>
>> On Thu, Jan 19, 2012 at 7:55 PM, Emalayan Vairavanathan <svemalayan at yahoo.com> wrote:
>> Hi Ketan,
>>
>> This was with swift-0.92.1. Now I have downloaded the latest swift 0.93 and getting totally different error messages with swift 0.93. I can ask Jon about these messages. (These scripts was working well with only Swift)
>>
>> Please let me know if you have any idea.
>>
>> Regards
>> Emalayan
>>
>> ===============================================================================================
>> Swift 0.93 swift-r5501 cog-r3350
>>
>> RunID: 20120119-1749-rjshh1r9
>> (input): found 10 files
>> Progress: time: Thu, 19 Jan 2012 17:49:20 -0800
>> Find: http://localhost:1984
>> Find: keepalive(120), reconnect - http://localhost:1984
>> Progress: time: Thu, 19 Jan 2012 17:49:22 -0800 Stage in:1 Submitted:9
>> Progress: time: Thu, 19 Jan 2012 17:49:25 -0800 Active:9 Stage out:1
>> Progress: time: Thu, 19 Jan 2012 17:49:26 -0800 Stage out:3 Finished successfully:7
>> Progress: time: Thu, 19 Jan 2012 17:49:28 -0800 Active:1 Finished successfully:10
>> Progress: time: Thu, 19 Jan 2012 17:49:29 -0800 Stage in:1 Submitting:11 Submitted:6 Finished successfully:12
>> Progress: time: Thu, 19 Jan 2012 17:49:30 -0800 Stage in:4 Submitted:1 Active:6 Stage out:2 Finished successfully:17
>> Progress: time: Thu, 19 Jan 2012 17:49:31 -0800 Active:1 Finished successfully:30
>> Exception in mConcatFit:
>> Arguments: [_concurrent/status_tbl-7a8340c2-045d-4039-a77c-00429b78d9c9-5, fits.tbl, stat_dir]
>> Host: localhost
>> Directory: SwiftMontage-20120119-1749-rjshh1r9/jobs/b/mConcatFit-b1sa4vlk
>> - - -
>>
>> Caused by: null
>> Caused by: org.globus.cog.abstraction.impl.common.execution.JobException: Job failed with an exit code of 520
>> Execution failed:
>> back_list:Table = org.griphyn.vdl.mapping.DataDependentException - Closed not derived due to errors in data dependencies
>>
>> From: Ketan Maheshwari <ketancmaheshwari at gmail.com>
>> To: Emalayan Vairavanathan <svemalayan at yahoo.com>
>> Cc: swift user <swift-user at ci.uchicago.edu>
>> Sent: Thursday, 19 January 2012 4:49 PM
>> Subject: Re: [Swift-user] Montage+Swift+Coasters
>>
>> Emalayan,
>>
>> From your symptoms, it seems you are facing the same issue as I've been. Could you tell more about the amount of data that needs to be staged to run the Montage stages during which these warnings turn up? How much time elapses since the start of your workflow after which you see these messages?
>>
>> Also, what version of Swift is this?
>>
>> Regards,
>> Ketan
>>
>> On Thu, Jan 19, 2012 at 5:51 PM, Emalayan Vairavanathan <svemalayan at yahoo.com> wrote:
>> Dear All,
>>
>> I have a problem in running Montage with Coasters (in our local cluster - no batch schedulers). After few stages the swift run-time continuously prints the warnings below. Any ideas ? Should I increase the heartbeat count ?
>>
>> Everything works fine when I try to run the same montage-scripts with swift on a single machine.
>>
>> Thank you
>> Emalayan
>>
>>
>> 2012-01-19 15:38:09,207-0800 WARN Command Command(119, HEARTBEAT): handling reply timeout; sendReqTime=120119-153609.206, sendTime=120119-153609.206, now=120119-153809.207
>> 2012-01-19 15:38:09,207-0800 INFO Command Command(119, HEARTBEAT): re-sending
>> 2012-01-19 15:38:09,209-0800 WARN Command Command(119, HEARTBEAT)fault was: Reply timeout
>> org.globus.cog.karajan.workflow.service.ReplyTimeoutException
>> at org.globus.cog.karajan.workflow.service.commands.Command.handleReplyTimeout(Command.java:288)
>> at org.globus.cog.karajan.workflow.service.commands.Command$Timeout.run(Command.java:293)
>> at java.util.TimerThread.mainLoop(Timer.java:534)
>> at java.util.TimerThread.run(Timer.java:484)
>>
>> _______________________________________________
>> Swift-user mailing list
>> Swift-user at ci.uchicago.edu
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>>
>>
>>
>> --
>> Ketan
>>
>>
>>
>>
>>
>>
>>
>> --
>> Ketan
>>
>>
>> _______________________________________________
>> Swift-user mailing list
>> Swift-user at ci.uchicago.edu
>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>
>
>
>
>
>
> --
> Ketan
>
>
>
>
>
>
>
> --
> Ketan
>
>
>
>
>
>
>
> --
> Ketan
>
>
>
>
>
>
>
> --
> Ketan
>
>
>
>
>
>
>
> --
> Ketan
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20120124/b56239c2/attachment.html>
More information about the Swift-user
mailing list