[Swift-devel] RE: [Swift-user] Execution error
Michael Wilde
wilde at mcs.anl.gov
Thu Apr 30 16:39:02 CDT 2009
And we should also drill back down to why (at least yesterday) the GT4
softev package failed, but the OSG client worked, for globus-job-run.
I guess its possible there is a host or CA cert issue here.
- Mike
On 4/30/09 4:31 PM, Mihael Hategan wrote:
> Can you guys try to run first.swift on ranger with the settings you have
> (you'll need to add "echo" to tc.data)?
>
>
> On Thu, 2009-04-30 at 16:13 -0500, Glen Hocky wrote:
>> I have the identical response on ranger. It started yesterday evening.
>> Possibly a problem that the TACC folks need to fix?
>>
>> Glen
>>
>> Yue, Chen - BMD wrote:
>>> Hi Michael,
>>>
>>> Thank you for the advices. I tested ranger with 1 job and new
>>> specifications of maxwalltime. It shows the following error message. I
>>> don't know if there is other problem with my setup. Thank you!
>>>
>>> /////////////////////////////////////////////////
>>> [yuechen at communicado PTMap2]$ swift PTMap2-unmod.swift -sites.file
>>> sites.xml -tc.file tc.data
>>> Swift 0.9rc2 swift-r2860 cog-r2388
>>> RunID: 20090430-1559-2vi6x811
>>> Progress:
>>> Progress: Stage in:1
>>> Progress: Submitting:1
>>> Progress: Submitting:1
>>> Progress: Submitted:1
>>> Progress: Active:1
>>> Failed to transfer wrapper log from
>>> PTMap2-unmod-20090430-1559-2vi6x811/info/i on ranger
>>> Progress: Active:1
>>> Failed to transfer wrapper log from
>>> PTMap2-unmod-20090430-1559-2vi6x811/info/k on ranger
>>> Progress: Stage in:1
>>> Progress: Active:1
>>> Failed to transfer wrapper log from
>>> PTMap2-unmod-20090430-1559-2vi6x811/info/m on ranger
>>> Progress: Failed:1
>>> Execution failed:
>>> Exception in PTMap2:
>>> Arguments: [e04.mzXML, ./seqs-ecolik12/fasta01, inputs-unmod.txt,
>>> parameters.txt]
>>> Host: ranger
>>> Directory: PTMap2-unmod-20090430-1559-2vi6x811/jobs/m/PTMap2-mbe6m5aj
>>> stderr.txt:
>>> stdout.txt:
>>> ----
>>> Caused by:
>>> Failed to start worker:
>>> null
>>> null
>>> org.globus.gram.GramException: The job manager detected an invalid
>>> script response
>>> at
>>> org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:530)
>>> at org.globus.gram.GramJob.setStatus(GramJob.java:184)
>>> at
>>> org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:176)
>>> at java.lang.Thread.run(Thread.java:619)
>>> Cleaning up...
>>> Shutting down service at https://129.114.50.163:45562
>>> <https://129.114.50.163:45562>
>>> Got channel MetaChannel: 20903429 -> GSSSChannel-null(1)
>>> - Done
>>> [yuechen at communicado PTMap2]$
>>> ///////////////////////////////////////////////////////////
>>>
>>> Chen, Yue
>>>
>>>
>>> *From:* Michael Wilde [mailto:wilde at mcs.anl.gov]
>>> *Sent:* Thu 4/30/2009 3:02 PM
>>> *To:* Yue, Chen - BMD; swift-devel
>>> *Subject:* Re: [Swift-user] Execution error
>>>
>>> Back on list here (I only went off-list to discuss accounts, etc)
>>>
>>> The problem in the run below is this:
>>>
>>> 2009-04-30 14:29:41,265-0500 DEBUG vdl:execute2 APPLICATION_EXCEPTION
>>> jobid=PTMap2-abeii5aj - Application exception: Job cannot be run with
>>> the given max walltime worker constraint (task: 3000, \
>>> maxwalltime: 2400s)
>>>
>>> You have this on the ptmap app in your tc.data:
>>>
>>> globus::maxwalltime=50
>>>
>>> But you only gave coasters 40 mins per coaster worker. So its
>>> complaining that it cant run a 50 minute job in a 40 minute (max)
>>> coaster worker. ;)
>>>
>>> I mentioned in a prior mail that you need to set the two time vals in
>>> your sites.xml entry; thats what you need to do next, now.
>>>
>>> change the coaster time in your sites.xml to:
>>> key="coasterWorkerMaxwalltime">00:51:00</profile>
>>>
>>> If you have more info on the variability of your ptmap run times, send
>>> that to the list, and we can discuss how to handle.
>>>
>>>
>>> (NOTE: doing grp -i of the log for "except" or scanning for "except"
>>> with an editor will often locate the first "exception" that your job
>>> encountered. Thats how I found the error above).
>>>
>>> Also, Yue, for testing new sites, or for validating that old sites still
>>> work, you should create the smallest possible ptmap workflow - 1 job if
>>> that is possible - and verify that this works. Then say 10 jobs to make
>>> sure scheduling etc is sane. Then, send in your huge jobs.
>>>
>>> With only 1 job, its easier to spot the errors in the log file.
>>>
>>> - Mike
>>>
>>>
>>> On 4/30/09 2:34 PM, Yue, Chen - BMD wrote:
>>>> Hi Michael,
>>>>
>>>> I run into the same messages again when I use Ranger:
>>>>
>>>> Progress: Selecting site:146 Stage in:25 Submitting:15 Submitted:821
>>>> Failed but can retry:16
>>>> Failed to transfer wrapper log from
>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/q on ranger
>>>> Failed to transfer wrapper log from
>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/l on ranger
>>>> Progress: Selecting site:146 Stage in:3 Submitting:1 Submitted:857
>>>> Failed but can retry:16
>>>> Failed to transfer wrapper log from
>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/v on ranger
>>>> Failed to transfer wrapper log from
>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/b on ranger
>>>> Failed to transfer wrapper log from
>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/0 on ranger
>>>> Failed to transfer wrapper log from
>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/a on ranger
>>>> Failed to transfer wrapper log from
>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/8 on ranger
>>>> Failed to transfer wrapper log from
>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/7 on ranger
>>>> Failed to transfer wrapper log from
>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/4 on ranger
>>>> Failed to transfer wrapper log from
>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/a on ranger
>>>> Failed to transfer wrapper log from
>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/4 on ranger
>>>> Failed to transfer wrapper log from
>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/8 on ranger
>>>> Failed to transfer wrapper log from
>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/7 on ranger
>>>> Failed to transfer wrapper log from
>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/x on ranger
>>>> Failed to transfer wrapper log from
>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/3 on ranger
>>>> Failed to transfer wrapper log from
>>>> PTMap2-unmod-20090430-1428-v0c5di5c/info/q on ranger
>>>> The log for the search is at :
>>>> /home/yuechen/PTMap2/PTMap2-unmod-20090430-1428-v0c5di5c.log
>>>>
>>>> The sites.xml I have is:
>>>>
>>>> <pool handle="ranger">
>>>> <execution provider="coaster"
>>>> url="gatekeeper.ranger.tacc.teragrid.org"
>>>> jobManager="gt2:gt2:SGE"/>
>>>> <gridftp url="gsiftp://gridftp.ranger.tacc.teragrid.org:2811/" />
>>>> <profile namespace="env"
>>>> key="SWIFT_JOBDIR_PATH">/tmp/yuechen/jobdir</profile>
>>>> <profile namespace="globus" key="project">TG-CCR080022N</profile>
>>>> <profile namespace="globus" key="coastersPerNode">16</profile>
>>>> <profile namespace="globus" key="queue">development</profile>
>>>> <profile namespace="globus"
>>>> key="coasterWorkerMaxwalltime">00:40:00</profile>
>>>> <profile namespace="globus" key="maxwalltime">31</profile>
>>>> <profile namespace="karajan" key="initialScore">50</profile>
>>>> <profile namespace="karajan" key="jobThrottle">10</profile>
>>>> <workdirectory>/work/01164/yuechen/swiftwork</workdirectory>
>>>> </pool>
>>>> The tc.data I have is:
>>>>
>>>> ranger PTMap2
>>>> /share/home/01164/yuechen/PTMap2/PTMap2 INSTALLED
>>>> INTEL32::LINUX globus::maxwalltime=50
>>>>
>>>> I'm using swift 0.9 rc2
>>>>
>>>> Thank you very much for help!
>>>>
>>>> Chen, Yue
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------
>>>> *From:* Michael Wilde [mailto:wilde at mcs.anl.gov]
>>>> *Sent:* Thu 4/30/2009 2:05 PM
>>>> *To:* Yue, Chen - BMD
>>>> *Subject:* Re: [Swift-user] Execution error
>>>>
>>>>
>>>>
>>>> On 4/30/09 1:51 PM, Yue, Chen - BMD wrote:
>>>> > Hi Michael,
>>>> >
>>>> > When I tried to activate my account, I encountered the following
>>> error:
>>>> >
>>>> > "Sorry, this account is in an invalid state. You may not activate
>>> your
>>>> > at this time."
>>>> >
>>>> > I used the username and password from TG-CDA070002T. Should I use a
>>>> > different password?
>>>>
>>>> If you can already login to Ranger, then you are all set - you must have
>>>> done this previously.
>>>>
>>>> I thought you had *not*, because when I looked up your login on ranger
>>>> ("finger yuechen") it said "never logged in". But seems like that info
>>>> is incorrect.
>>>>
>>>> If you have ptmap compiled, seems like you are almost all set.
>>>>
>>>> Let me know if it works.
>>>>
>>>> - Mike
>>>>
>>>> > Thanks!
>>>> >
>>>> > Chen, Yue
>>>> >
>>>> >
>>>> >
>>> ------------------------------------------------------------------------
>>>> > *From:* Michael Wilde [mailto:wilde at mcs.anl.gov]
>>>> > *Sent:* Thu 4/30/2009 1:07 PM
>>>> > *To:* Yue, Chen - BMD
>>>> > *Cc:* swift user
>>>> > *Subject:* Re: [Swift-user] Execution error
>>>> >
>>>> > Yue, use this XML pool element to access ranger:
>>>> >
>>>> > <pool handle="ranger">
>>>> > <execution provider="coaster"
>>>> > url="gatekeeper.ranger.tacc.teragrid.org"
>>>> > jobManager="gt2:gt2:SGE"/>
>>>> > <gridftp
>>> url="gsiftp://gridftp.ranger.tacc.teragrid.org:2811/" />
>>>> > <profile namespace="env"
>>>> > key="SWIFT_JOBDIR_PATH">/tmp/yuechen/jobdir</profile>
>>>> > <profile namespace="globus"
>>> key="project">TG-CCR080022N</profile>
>>>> > <profile namespace="globus" key="coastersPerNode">16</profile>
>>>> > <profile namespace="globus" key="queue">development</profile>
>>>> > <profile namespace="globus"
>>>> > key="coasterWorkerMaxwalltime">00:40:00</profile>
>>>> > <profile namespace="globus" key="maxwalltime">31</profile>
>>>> > <profile namespace="karajan" key="initialScore">50</profile>
>>>> > <profile namespace="karajan" key="jobThrottle">10</profile>
>>>> > <workdirectory>/work/00306/tg455797/swiftwork</workdirectory>
>>>> > </pool>
>>>> >
>>>> >
>>>> > You will need to also do these steps:
>>>> >
>>>> > Go to this web page to enable your Ranger account:
>>>> >
>>>> > https://tas.tacc.utexas.edu/TASMigration/AccountActivation.aspx
>>>> >
>>>> > Then login to Ranger via the TeraGrid portal and put your ssh keys in
>>>> > place (assuming you use ssh keys, which you should)
>>>> >
>>>> > While on Ranger, do this:
>>>> >
>>>> > echo $WORK
>>>> > mkdir $work/swiftwork
>>>> >
>>>> > and put the full path of your $WORK/swiftwork directory in the
>>>> > <workdirectory> element above. (My login is tg455etc, yours is
>>> yuechen)
>>>> >
>>>> > Then scp your code to Ranger and compile it.
>>>> >
>>>> > Then create a tc.data entry for your ptmap app
>>>> >
>>>> > Next, set your time values in the sites.xml entry above to suitable
>>>> > values for Ranger. You'll need to measure times, but I think you will
>>>> > find Ranger about twice as fast as Mercury for CPU-bound jobs.
>>>> >
>>>> > The values above were set for one app job per coaster. I think
>>> you can
>>>> > probably do more.
>>>> >
>>>> > If you estimate a run time of 5 minutes, use:
>>>> >
>>>> > <profile namespace="globus"
>>>> > key="coasterWorkerMaxwalltime">00:30:00</profile>
>>>> > <profile namespace="globus" key="maxwalltime">5</profile>
>>>> >
>>>> > Other people on the list - please sanity check what I suggest here.
>>>> >
>>>> > - Mike
>>>> >
>>>> >
>>>> > On 4/30/09 12:40 PM, Michael Wilde wrote:
>>>> > > I just checked - TG-CDA070002T has indeed expired.
>>>> > >
>>>> > > The best for now is to move to use (only) Ranger, under this
>>> account:
>>>> > > TG-CCR080022N
>>>> > >
>>>> > > I will locate and send you a sites.xml entry in a moment.
>>>> > >
>>>> > > You need to go to a web page to activate your Ranger login.
>>>> > >
>>>> > > Best to contact me in IM and we can work this out.
>>>> > >
>>>> > > - Mike
>>>> > >
>>>> > >
>>>> > >
>>>> > > On 4/30/09 12:23 PM, Michael Wilde wrote:
>>>> > >> Also, what account are you running under? We may need to change
>>>> you to
>>>> > >> a new account - as the OSG Training account expires today.
>>>> > >> If that happend at Noon, it *might* be the problem.
>>>> > >>
>>>> > >> - Mike
>>>> > >>
>>>> > >>
>>>> > >> On 4/30/09 12:08 PM, Yue, Chen - BMD wrote:
>>>> > >>> Hi,
>>>> > >>>
>>>> > >>> I came back to re-run my application on NCSA Mercury which was
>>>> tested
>>>> > >>> successfully last week after I just set up coasters with
>>> swift 0.9,
>>>> > >>> but I got many messages like the following:
>>>> > >>>
>>>> > >>> Progress: Stage in:219 Submitting:803 Submitted:1
>>>> > >>> Progress: Stage in:129 Submitting:703 Submitted:190 Failed
>>>> but can
>>>> > >>> retry:1
>>>> > >>> Progress: Stage in:38 Submitting:425 Submitted:556 Failed
>>> but can
>>>> > >>> retry:4
>>>> > >>> Failed to transfer wrapper log from
>>>> > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/h on NCSA_MERCURY
>>>> > >>> Failed to transfer wrapper log from
>>>> > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/j on NCSA_MERCURY
>>>> > >>> Failed to transfer wrapper log from
>>>> > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/p on NCSA_MERCURY
>>>> > >>> Failed to transfer wrapper log from
>>>> > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/1 on NCSA_MERCURY
>>>> > >>> Failed to transfer wrapper log from
>>>> > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/b on NCSA_MERCURY
>>>> > >>> Failed to transfer wrapper log from
>>>> > >>> PTMap2-unmod-20090430-1203-r19dxq10/info/c on NCSA_MERCURY
>>>> > >>> Progress: Stage in:1 Submitted:1013 Active:1 Failed but can
>>>> retry:8
>>>> > >>> Progress: Submitted:1011 Active:1 Failed but can retry:11
>>>> > >>> The log file for the successful run last week is ;
>>>> > >>> /home/yuechen/PTMap2/PTMap2-unmod-20090422-1216-4s3037gf.log
>>>> > >>>
>>>> > >>> The log file for the failed run is :
>>>> > >>> /home/yuechen/PTMap2/PTMap2-unmod-20090430-1151-rf2uuhb7.log
>>>> > >>>
>>>> > >>> I don't think I did anything different, so I don't know why this
>>>> time
>>>> > >>> they failed. The sites.xml for Mercury is:
>>>> > >>>
>>>> > >>> <pool handle="NCSA_MERCURY">
>>>> > >>> <gridftp url="gsiftp://gridftp-hg.ncsa.teragrid.org"/>
>>>> > >>> <execution provider="coaster"
>>> url="grid-hg.ncsa.teragrid.org"
>>>> > >>> jobManager="gt2:PBS"/>
>>>> > >>>
>>> <workdirectory>/gpfs_scratch1/yuechen/swiftwork</workdirectory>
>>>> > >>> <profile namespace="globus" key="queue">debug</profile>
>>>> > >>> </pool>
>>>> > >>>
>>>> > >>> Thank you for help!
>>>> > >>>
>>>> > >>> Chen, Yue
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> > >>> This email is intended only for the use of the individual or
>>> entity
>>>> > >>> to which it is addressed and may contain information that is
>>>> > >>> privileged and confidential. If the reader of this email
>>> message is
>>>> > >>> not the intended recipient, you are hereby notified that any
>>>> > >>> dissemination, distribution, or copying of this communication is
>>>> > >>> prohibited. If you have received this email in error, please
>>> notify
>>>> > >>> the sender and destroy/delete all copies of the transmittal.
>>>> Thank you.
>>>> > >>>
>>>> > >>>
>>>> > >>>
>>>> >
>>> ------------------------------------------------------------------------
>>>> > >>>
>>>> > >>> _______________________________________________
>>>> > >>> Swift-user mailing list
>>>> > >>> Swift-user at ci.uchicago.edu
>>>> > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>>>> > >> _______________________________________________
>>>> > >> Swift-user mailing list
>>>> > >> Swift-user at ci.uchicago.edu
>>>> > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>>>> > > _______________________________________________
>>>> > > Swift-user mailing list
>>>> > > Swift-user at ci.uchicago.edu
>>>> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > This email is intended only for the use of the individual or
>>> entity to
>>>> > which it is addressed and may contain information that is
>>> privileged and
>>>> > confidential. If the reader of this email message is not the intended
>>>> > recipient, you are hereby notified that any dissemination,
>>> distribution,
>>>> > or copying of this communication is prohibited. If you have received
>>>> > this email in error, please notify the sender and destroy/delete all
>>>> > copies of the transmittal. Thank you.
>>>>
>>>>
>>>>
>>>>
>>>> This email is intended only for the use of the individual or entity to
>>>> which it is addressed and may contain information that is privileged and
>>>> confidential. If the reader of this email message is not the intended
>>>> recipient, you are hereby notified that any dissemination, distribution,
>>>> or copying of this communication is prohibited. If you have received
>>>> this email in error, please notify the sender and destroy/delete all
>>>> copies of the transmittal. Thank you.
>>>
>>>
>>>
>>> This email is intended only for the use of the individual or entity to
>>> which it is addressed and may contain information that is privileged
>>> and confidential. If the reader of this email message is not the
>>> intended recipient, you are hereby notified that any dissemination,
>>> distribution, or copying of this communication is prohibited. If you
>>> have received this email in error, please notify the sender and
>>> destroy/delete all copies of the transmittal. Thank you.
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
More information about the Swift-devel
mailing list