From vipulkrsingh at gmail.com Thu Apr 1 12:25:05 2010 From: vipulkrsingh at gmail.com (Vipul Kumar Singh) Date: Thu, 1 Apr 2010 22:55:05 +0530 Subject: [Swift-devel] Re: scheduling In-Reply-To: <1270058003.19795.25.camel@localhost> References: <4745903.473531270038507652.JavaMail.root@zimbra> <1270052164.16103.10.camel@localhost> <1270058003.19795.25.camel@localhost> Message-ID: Please take a look at this draft for project proposal to GSoC. Looking forward to your suggestions. ABSTRACT: The aim is to develop data-site catalog containing information about logical name of data files and sites the files are available on. Based on the catalog the existing weighted scheduler can add bias to the sites. This will help scheduler to choose sites based on data availability and reduce the data transfers required. *1. Provide a 1-2 paragraph summary of the project you propose to do over the summer.* The main purpose of the project is to make site selection aware of data placement on sites so that when choosing sites the scheduler prefers those that already have input data files in order to reduce the amount of data transfer necessary. To achieve the objectives, a catalog is implemented that stores updated details about the data files and sites the files reside on. The scheduler refers to the catalog and trys to schedule jobs to sites with access to required data. *2. What Globus project (see list in http://dev.globus.org/) does your GSoC project most closely relate to?* The project is related to the Incubator project - Swift. *3. Have you contacted a Globus mentor about this project proposal? If so, who?* Yes i approached Mentor Michael Wilde on swift-devel mailing list regarding the project, also got valuable suggestions from Mihael Hategan and Ben Clifford. *4. What languages, libraries, toolkits, etc. will you use for this project? If part of the project will require researching technologies to decide which one is better suited, just say so (do mention what technologies you will be looking at, if you already know this)* The coding part will be mostly in Java. *5. What would be the main deliverables for your project? Please include a rough timeline for these deliverables. We are not asking you to commit to specific dates right now, and you can certainly tweak the deliverables later on (in fact, we expect you will do so as you interact more with your mentor and the Globus community). However, please give us an approximate idea of what you expect to produce throughout the summer.* deliverables: + Replica catalog containing mapping of logical names to the site it resides on. + Modified scheduler that takes scheduling decision influnced by the catalog. rough timeline: During First Phase (upto 30th may) the specifications of the catalog is decided and an API is designed for the catalog implementation so that various already existing implementations like globus RLS can be plugged in. During second phase(upto 30th june) the catalog is implemented . During third phased(upto 9th Aug) the scheduler is modified to take catalog into consideration. The scheduler is thoroughly tested and scrutinized. *6. What are your qualifications for this project? Please let us know what previous experience you have with the technologies you listed in question (3). Take into account that having limited knowledge on the Globus Toolkit does not disqualify you from participating; GSoC is as much about learning as it is about writing code, and you will have until the summer to get up to speed.* I am an Undergraduate student doing engineering in Computer Science. I have a good understanding of object oriented programming languages such as Java and C++. I have experience working with gridSim toolkit which is implemented in java. *7. If you have little or no experience with Globus technologies, or any other technology involved in your project, will you be able to use the "Community Bonding Period" (April 20 - May 23) to get up to speed?* Yes, I will use community bonding period to understand the working of scheduler and mapper in swift and discuss about the logical issues regarding the catalog and its API. I will also use interim period to study about working of the globus RLS catalog and other similar implementations. *8. Will you have any other commitments during the summer? In particular, let us know if your school year ends later than May 23 (i.e., if you will still be doing final exams when GSoC starts) and if you are already commited to another job (an internship, a teaching/research assistantship at your university, etc.). This does not disqualify you from participating but you have to be upfront about how much time you'll be able to spend on your GSoC project.* No, I have no other reservations. *9. Please provide a contact e-mail address in case we need to discuss your project proposal further with you (the contact details you provide to Google will not be shared with us, so we need you to include them as part of your application too).* vipulkrsingh at gmail.com *10. If you want to provide any additional details about your project, please do so here:* Thank You Vipul Kumar Singh -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Thu Apr 1 23:44:00 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 1 Apr 2010 23:44:00 -0500 (CDT) Subject: [Swift-devel] Article on Swift in SciDAC Review Message-ID: <10961150.1131270183440578.JavaMail.root@zimbra> This just came out: http://www.scidacreview.org/1002/html/swift.html - Mike -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From b_bprimal at hotmail.com Sat Apr 3 18:01:21 2010 From: b_bprimal at hotmail.com (Bhaskar Prasad Rimal) Date: Sat, 3 Apr 2010 23:01:21 +0000 Subject: [Swift-devel] Workflow Scheduling Message-ID: Dear All, I want to do my thesis on Workflow based scheduling on Cloud Computing environment. Number of random generated jobs (direct acyclic graphs form) will be summitted and schedule according to proposed policy and scheduling algorithms and measure execution time, throughput, fairness and compare with other approach. For the simulation work, which simulator (like CloudSim, Swift, SimGrid, GridSim etc) is suitable for this thesis, Could you suggest me. Thank you so much for your kind cooperation. Regards Bhaskar _________________________________________________________________ Hotmail: Free, trusted and rich email service. https://signup.live.com/signup.aspx?id=60969 -------------- next part -------------- An HTML attachment was scrubbed... URL: From vipulkrsingh at gmail.com Sat Apr 3 18:56:27 2010 From: vipulkrsingh at gmail.com (Vipul Kumar Singh) Date: Sun, 4 Apr 2010 05:26:27 +0530 Subject: [Swift-devel] hi Message-ID: hi, I am trying this little experiment . Note 1) A List is maintained that contains FileName - PathWhereFileIsStored - Count (When count is zero record is removed from list) Note 2) Every site has a shared_directory For all the Stage_In files required for a job DO check if file is in List IF file in List then DO NOT copy file, rather in the temporary directory of this run, create Link to the file in shared directory AND increment count ELSE if file is NOT in List Copy the file to shared directory, create Link, create entry in List and set COUNT to 1 During Stage_out for a job IF COUNT value of the file is 1 remove file from the list ELSE decrement COUNT for the file I still don't understand a large part of the work flow. Going through the code i am not able to figure out where exactly the files required to be staged_in for a particular job is determined. The above case requires the data files to be present in shared_directory only when a job in need of those files is being executed.The above logic should work best for cases when a series of job dependent on same files are submitted to the scheduler. Though this experiment is not very useful in this form, i will be trying to run several swiftscripts from different terminals simultaneously and see if the logic is working and then later try to extend it to case where the swiftscripts are run by different users. And instead of LIST can we use LRUFileCache ? Thank You Vipul Kumar Singh -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Sun Apr 4 13:17:58 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 4 Apr 2010 13:17:58 -0500 (CDT) Subject: [Swift-devel] Workflow Scheduling In-Reply-To: Message-ID: <11626403.41721270405078068.JavaMail.root@zimbra> Hi Bhaskar, Swift is a scripting language which you could use to actually run cloud-based workflows. But its not a simulation language, in that it does only real execution and doesn't perform any simulation-based modeling or mathematical calculations of simulated quantities like run times or queue lengths. Conceivably, one could use Swift to evaluate some light-weight simulated approaches (ie, by testing with "dummy" jobs or by running many simulated dummy jobs on each remote compute core to simulate a much larger resource pool). And then get some statistics from Swift's log plot generator. Whether this is of use for your research would need to be examined in much greater depth, compared to using a true simulator. I don't know anything about the other tools you mention, so I cant comment on them. Perhaps others in this list can. Regards, Mike ----- "Bhaskar Prasad Rimal" wrote: > Dear All, > > I want to do my thesis on Workflow based scheduling on Cloud Computing > environment. Number of random generated jobs (direct acyclic graphs > form) will be summitted and schedule according to proposed policy and > scheduling algorithms and measure execution time, throughput, fairness > and compare with other approach. > > For the simulation work, which simulator (like CloudSim, Swift, > SimGrid, GridSim etc) is suitable for this thesis, Could you suggest > me. > > > Thank you so much for your kind cooperation. > > Regards > > Bhaskar > > > Hotmail: Free, trusted and rich email service. Get it now. > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From vipulkrsingh at gmail.com Sun Apr 4 17:16:06 2010 From: vipulkrsingh at gmail.com (Vipul Kumar Singh) Date: Mon, 5 Apr 2010 03:46:06 +0530 Subject: [Swift-devel] Workflow Scheduling In-Reply-To: <11626403.41721270405078068.JavaMail.root@zimbra> References: <11626403.41721270405078068.JavaMail.root@zimbra> Message-ID: hi, I have used gridSim before. Its fairly easy to create new scheduling algorithms and test the performance. It provides many example of simple schedulers, one can start by modifying them and gradually build upon them. It has a good set of example to get started pretty quickly. About performance measurement, the toolkit generates a lot of log files at user, resource and job level. For cases where these logs are not sufficient I wrote the data, during execution, into custom logs files in csv format and later generated graphs. The toolkit provides a lot of scope for creating simulation environment and various scenarios can be simulated without much effort. Thanks Vipul On Sun, Apr 4, 2010 at 11:47 PM, Michael Wilde wrote: > Hi Bhaskar, > > Swift is a scripting language which you could use to actually run > cloud-based workflows. > > But its not a simulation language, in that it does only real execution and > doesn't perform any simulation-based modeling or mathematical calculations > of simulated quantities like run times or queue lengths. > > Conceivably, one could use Swift to evaluate some light-weight simulated > approaches (ie, by testing with "dummy" jobs or by running many simulated > dummy jobs on each remote compute core to simulate a much larger resource > pool). And then get some statistics from Swift's log plot generator. Whether > this is of use for your research would need to be examined in much greater > depth, compared to using a true simulator. > > I don't know anything about the other tools you mention, so I cant comment > on them. Perhaps others in this list can. > > Regards, > > Mike > > ----- "Bhaskar Prasad Rimal" wrote: > > > Dear All, > > > > I want to do my thesis on Workflow based scheduling on Cloud Computing > > environment. Number of random generated jobs (direct acyclic graphs > > form) will be summitted and schedule according to proposed policy and > > scheduling algorithms and measure execution time, throughput, fairness > > and compare with other approach. > > > > For the simulation work, which simulator (like CloudSim, Swift, > > SimGrid, GridSim etc) is suitable for this thesis, Could you suggest > > me. > > > > > > Thank you so much for your kind cooperation. > > > > Regards > > > > Bhaskar > > > > > > Hotmail: Free, trusted and rich email service. Get it now. > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Tue Apr 6 09:29:34 2010 From: wilde at mcs.anl.gov (wilde at mcs.anl.gov) Date: Tue, 6 Apr 2010 09:29:34 -0500 (CDT) Subject: [Swift-devel] Re: [Swift-user] swift and fusion In-Reply-To: <20059599.84051270563655816.JavaMail.root@zimbra> Message-ID: <12655678.84291270564174420.JavaMail.root@zimbra> Thanks, Marcin - that is very helpful. I'm cc'ing to Swift-Devel. Looks like the regular queue behaves like TeraGrid (eg Abe & Queenbee) and teh shared queue like PADS and TeraPort. We need to get back to you with more info, but the following sites.xml should (approximately?) work for you on the Fusion regular queue: 3600 8 8 1 4 1.27 10000 regular /scratch/local/$USER $HOME/swiftwork Notes: - set maxWallTime in tc.data on your job - set maxTime above to be the largest wall time Swift should request from PBS - leave out scratch in your first run; set it to a large local disk dir on the worker nodes, likely documented in Fusion doc pages, or you can find out by logging into a node with qsub -I and exploring with df and mount commands. - set your workdirectory as you do in your current sites.xml - you may want/need to adjust the several throttles based on experience and your run profiles. We can discuss more as needed. Mike ----- "Marcin Hitczenko" wrote: > Hi Michael, > > Here is a fragment of an email I was sent from the people running > fusion. > I am not sure if it is sufficient for finding out what you want (it > kind > of sounds like it assigns jobs to hosts, but I am of course not sure > at > all). If this is the case, then I would like to learn to use > coasters. > > Best, > > Marcin > > Finally, please note that time on fusion will be charged by > core-hour. Since there are 8 cores on each node in fusion, and > since each node is dedicated to a single job (except for the nodes > in the shared queue and other special cases), this means a 1 hour > job running on 1 node will be charged 8 core-hours. > > 2) Jobs are now prioritized in the job queue based on the priority > assigned to the job's project. > > 3) Since jobs on the regular nodes have exclusive access to all the > cores (processors) on the nodes, we've configured the resource > manager (pbs) to automatically add the "ppn=8" (process(or)s per > node) property to the count of nodes given to qsub. For example, > if you submit a job requesting 32 nodes with "qsub -l nodes=32", > the resource manager will convert this to "qsub -l nodes=32:ppn=8" > to reflect the fact that the job has access to all 256 cores on > the > allocated nodes. If you request less processors per node (by, eg, > using "ppn=4"), the resource manager will change the request to > ppn=8. > > A consequence of having "ppn=8" added to the node count is that > the > $PBS_NODEFILE created for your job will list each node assigned to > your job 8 times. The default (Hydra) mpiexec used on fusion for > running MPI jobs will see this and automatically use all the cores > on all the nodes assigned to your job. Other applications which > also use $PBS_NODEFILE should also use all the cores on the nodes. > If your jobs script counts the number of lines in $PBS_NODEFILE to > determine the number of nodes, this will actually yield the number > of processors; instead, you need to count the number of unique > nodes listed in $PBS_NODEFILE in order to determine the number of > nodes. For example, using Bourne-shell syntax, the lines > > nprocs=`wc -l < $PBS_NODEFILE` > nnodes=`sort -u $PBS_NODEFILE` > > will set $nprocs to the number of processors assigned to your job > and $nnodes to the number of nodes assigned to your job (for csh, > simply add "set " to the start of each of these lines). > > Jobs submitted to the shared queue won't have the ppn=8 property > added, nor will a ppn property submitted with the job be changed. > > > > > > > > > > > > > > > Hi Marcin, > > > > We need to do a little research here and then document the findings. > The > > issue is whether Fusion assigns jobs to cores (like TeraPort and > PADS) or > > hosts (like most TeraGrid PBS sites seem to). > > > > If Fusion schedules by-core, then PBS will just place your jobs on > free > > cores, which may be on the same host (as the cluster gets filled) or > may > > not be. > > > > If it schedules by-host, then you'll need to use coasters to avoid > wasting > > hosts. Mihael: the last I recall, you were uncertain how to > determine > > this. Is it still an open issue? It seems we need to document a > query one > > can pose to PBS to determine how its configured. > > > > For sites that schedule by-host, the only convenient way to use all > cores > > on a host is to use coasters and specify workersPerNode in > sites.xml. > > > > There is another method worked OK for uniform-length jobs, which > involves > > job clustering and a tiny mod to the swift clustering script to run > all > > the jobs in a cluster in parallel. That was used effectively on a > TeraGrid > > site, but I'd use it only as a last resort. > > > > - Mike > > > > > > > > > > ----- "Marcin Hitczenko" wrote: > > > >> Hi, > >> > >> I am using swift to submit several R jobs on fusion and am trying > to > >> determine whether or not I am making use of all the available > cores > >> on > >> each node (I believe there are 8 cores on each node). I submitted > 10 > >> really short jobs to try to see if I could determine what was > going > >> on, > >> but I don't really know what to look for. > >> > >> In case it is useful, I am attaching the .log file, sites.xml, and > an > >> info > >> file for one of the jobs. > >> > >> Thanks for your help, > >> > >> Marcin > >> _______________________________________________ > >> Swift-user mailing list > >> Swift-user at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From benc at hawaga.org.uk Wed Apr 7 10:37:54 2010 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 7 Apr 2010 15:37:54 +0000 (GMT) Subject: [Swift-devel] swift vs opm-collections In-Reply-To: References: <4BB0B748.2010105@lncc.br> Message-ID: I started formalising notes on Swift vs OPM Collections, for the PC3 report. However the notes turned into a 4-page document, which you can read here: http://www.hawaga.org.uk/ben/tech/swift/swift-opm-collections.pdf The later sections are the least well defined, but most interesting to swift. This is too long to go into the PC3 paper, but if anyone can be bothered to read and suggest which parts they think are especially relevant, that would be cool. -- http://www.hawaga.org.uk/ben/ From aespinosa at cs.uchicago.edu Wed Apr 14 02:41:07 2010 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Wed, 14 Apr 2010 02:41:07 -0500 Subject: [Swift-devel] Re: [Swift-user] throttling mapping threads In-Reply-To: <1271223942.7014.8.camel@localhost> References: <1270877776.14163.1.camel@localhost> <1271106661.2089.8.camel@origin> <1271118497.2089.10.camel@origin> <1271223942.7014.8.camel@localhost> Message-ID: Moving the thread to swift-devel. Ah you're right. i did a thread trace in my old setup: $ jstack 8026 | grep "state =" | cat -n 1 Thread 8051: (state = BLOCKED) 2 Thread 8050: (state = BLOCKED) 3 Thread 8047: (state = BLOCKED) 4 Thread 8045: (state = BLOCKED) 5 Thread 8043: (state = BLOCKED) 6 Thread 8042: (state = BLOCKED) 7 Thread 8041: (state = BLOCKED) 8 Thread 8040: (state = BLOCKED) 9 Thread 8034: (state = BLOCKED) 10 Thread 8033: (state = BLOCKED) 11 Thread 8032: (state = BLOCKED) 12 Thread 8026: (state = BLOCKED) Also I added a line in my ext mapper that creates a files. after 13 minutes from the time i started the workflow, only 12 files were created. i'll check with the new version's swift.log on how long these files are expected to come. -Alla 2010/4/14 Mihael Hategan : > So I looked at the problem in a bit more detail. > > The part that starts the external process is run from the karajan worker > threads which are limited in number (somewhere between 1 and 8)*. So > that's the maximum number of concurrent invocations of an external > mapper. Do you actually see all 400 of them run at once? > > Mihael > > (*) Funny thing that for cooperative multi-tasking such non-cooperating > tasks as the example above were the thing that prompted the development > of preemptive multi-tasking. And yet here's an example where, due to > lazy coding, it turns out to be helpful. > > On Tue, 2010-04-13 at 11:52 -0500, Allan Espinosa wrote: >> oh i was referring to my ext mapper script >> 1 swift script invoked once, inside of it is a foreach doing 400+ ext mappings. >> >> -Allan >> >> 2010/4/13 Ben Clifford : >> >> one for the whole script which is invoked a lot of times >> > >> > >> > you're invoking the same swiftscript a whole lot of times? From aespinosa at cs.uchicago.edu Wed Apr 21 20:56:31 2010 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Wed, 21 Apr 2010 20:56:31 -0500 Subject: [Swift-devel] coasters submitting improper gt2 rsl jobs Message-ID: <20100422015440.GA24449@communicado.ci.uchicago.edu> Hi, Where can I get the RSL file when swift generates a submission to a GRAM2 endpoint? swift version: r3266 cog version: r2739 attachments: sites.xml, logfile session output: Caused by: Could not submit job Caused by: Could not start coaster service Caused by: Cannot parse the given RSL Caused by: java.lang.NullPointerException at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.notEmpty(JobSubmissionTaskHandler.java:694) at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.prepareSpecification(JobSubmissionTaskHandler.java:328) at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:87) at org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) at org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.startService(ServiceManager.java:179) at org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.reserveService(ServiceManager.java:120) at org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.reserveService(ServiceManager.java:134) at org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.getChannel(JobSubmissionTaskHandler.java:109) at org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.getChannel(JobSubmissionTaskHandler.java:109) at org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:95) at org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) at org.globus.cog.karajan.scheduler.submitQueue.NonBlockingSubmit.run(NonBlockingSubmit.java:86) at edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431) at edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run(FutureTask.java:166) at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643) at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668) at java.lang.Thread.run(Thread.java:595) Cleaning up... Done -- Allan M. Espinosa PhD student, Computer Science University of Chicago -------------- next part -------------- A non-text attachment was scrubbed... Name: sites.xml Type: text/xml Size: 874 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: postproc-TEST_firefly1.log.bz2 Type: application/x-bzip2 Size: 313221 bytes Desc: not available URL: From hategan at mcs.anl.gov Wed Apr 21 21:15:28 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 21 Apr 2010 21:15:28 -0500 Subject: [Swift-devel] coasters submitting improper gt2 rsl jobs In-Reply-To: <20100422015440.GA24449@communicado.ci.uchicago.edu> References: <20100422015440.GA24449@communicado.ci.uchicago.edu> Message-ID: <1271902528.31945.22.camel@localhost> On Wed, 2010-04-21 at 20:56 -0500, Allan Espinosa wrote: > Hi, > > Where can I get the RSL file when swift generates a submission to a GRAM2 > endpoint? The code is lying. There is no RSL at that stage. I changed that error message. I also fixed the underlying problem: return l != null || !l.isEmpty(); -> return l != null && !l.isEmpty(); Mihael > > swift version: r3266 > cog version: r2739 > > attachments: sites.xml, logfile > > session output: > > Caused by: > Could not submit job > Caused by: > Could not start coaster service > Caused by: > Cannot parse the given RSL > Caused by: > java.lang.NullPointerException > at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.notEmpty(JobSubmissionTaskHandler.java:694) > at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.prepareSpecification(JobSubmissionTaskHandler.java:328) > at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:87) > at org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) > at org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.startService(ServiceManager.java:179) > at org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.reserveService(ServiceManager.java:120) > at org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.reserveService(ServiceManager.java:134) > at org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.getChannel(JobSubmissionTaskHandler.java:109) > at org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.getChannel(JobSubmissionTaskHandler.java:109) > at org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:95) > at org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) > at org.globus.cog.karajan.scheduler.submitQueue.NonBlockingSubmit.run(NonBlockingSubmit.java:86) > at edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431) > at edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run(FutureTask.java:166) > at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643) > at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668) > at java.lang.Thread.run(Thread.java:595) > > Cleaning up... > Done > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel From aespinosa at cs.uchicago.edu Wed Apr 21 21:18:45 2010 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Wed, 21 Apr 2010 21:18:45 -0500 Subject: [Swift-devel] coasters submitting improper gt2 rsl jobs In-Reply-To: <1271902528.31945.22.camel@localhost> References: <20100422015440.GA24449@communicado.ci.uchicago.edu> <1271902528.31945.22.camel@localhost> Message-ID: cool. so i just update my tree then? Thanks! -Allan 2010/4/21 Mihael Hategan : > On Wed, 2010-04-21 at 20:56 -0500, Allan Espinosa wrote: >> Hi, >> >> Where can I get the RSL file when swift generates a submission to a GRAM2 >> endpoint? > > The code is lying. There is no RSL at that stage. I changed that error > message. > > I also fixed the underlying problem: > return l != null || !l.isEmpty(); -> return l != null && !l.isEmpty(); > > Mihael > >> >> swift version: r3266 >> cog version: r2739 >> >> attachments: sites.xml, logfile >> >> session output: >> >> Caused by: >> ? ? ? ? Could not submit job >> Caused by: >> ? ? ? ? Could not start coaster service >> Caused by: >> ? ? ? ? Cannot parse the given RSL >> Caused by: >> ? ? ? ? java.lang.NullPointerException >> ? ? ? ? at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.notEmpty(JobSubmissionTaskHandler.java:694) >> ? ? ? ? at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.prepareSpecification(JobSubmissionTaskHandler.java:328) >> ? ? ? ? at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:87) >> ? ? ? ? at org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) >> ? ? ? ? at org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.startService(ServiceManager.java:179) >> ? ? ? ? at org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.reserveService(ServiceManager.java:120) >> ? ? ? ? at org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.reserveService(ServiceManager.java:134) >> ? ? ? ? at org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.getChannel(JobSubmissionTaskHandler.java:109) >> ? ? ? ? at org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.getChannel(JobSubmissionTaskHandler.java:109) >> ? ? ? ? at org.globus.cog.abstraction.impl.execution.coaster.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:95) >> ? ? ? ? at org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) >> ? ? ? ? at org.globus.cog.karajan.scheduler.submitQueue.NonBlockingSubmit.run(NonBlockingSubmit.java:86) >> ? ? ? ? at edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431) >> ? ? ? ? at edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run(FutureTask.java:166) >> ? ? ? ? at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643) >> ? ? ? ? at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668) >> ? ? ? ? at java.lang.Thread.run(Thread.java:595) >> >> Cleaning up... >> ?Done >> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > -- Allan M. Espinosa PhD student, Computer Science University of Chicago From hategan at mcs.anl.gov Wed Apr 21 21:40:50 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 21 Apr 2010 21:40:50 -0500 Subject: [Swift-devel] coasters submitting improper gt2 rsl jobs In-Reply-To: References: <20100422015440.GA24449@communicado.ci.uchicago.edu> <1271902528.31945.22.camel@localhost> Message-ID: <1271904050.559.1.camel@localhost> On Wed, 2010-04-21 at 21:18 -0500, Allan Espinosa wrote: > cool. so i just update my tree then? Yep. Though are you sure you want to use the trunk code for what you are doing? From aespinosa at cs.uchicago.edu Wed Apr 21 21:48:33 2010 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Wed, 21 Apr 2010 21:48:33 -0500 Subject: [Swift-devel] coaster worker.pl syntax errors for perl installations on RHEL4 variants Message-ID: Below's a small modification to run on lower perl versions. I queried OSG RESS and most of them are RHEL4 variants. --- a/modules/provider-coaster/resources/worker.pl +++ b/modules/provider-coaster/resources/worker.pl @@ -134,7 +134,8 @@ sub hts { sub reconnect() { my $fail = 0; my $any; - my $i, $j; + my $i; + my $j; for ($i = 0; $i < MAX_RECONNECT_ATTEMPTS; $i++) { wlog INFO, "Connecting ($i)...\n"; my $sz = @HOSTNAME; [aespinosa at communicado resources]$ perl --version This is perl, v5.8.5 built for x86_64-linux-thread-multi Copyright 1987-2004, Larry Wall Perl may be copied only under the terms of either the Artistic License or the GNU General Public License, which may be found in the Perl 5 source kit. Complete documentation for Perl, including FAQ lists, should be found on this system using `man perl' or `perldoc perl'. If you have access to the Internet, point your browser at http://www.perl.com/, the Perl Home Page. i don't know enough perl kung-fu to fix this one though :( [aespinosa at communicado resources]$ perl -c worker.pl Global symbol "$SCHEME" requires explicit package name at worker.pl line 189. Global symbol "$HOSTNAME" requires explicit package name at worker.pl line 189. Global symbol "$PORT" requires explicit package name at worker.pl line 189. worker.pl had compilation errors. -- From aespinosa at cs.uchicago.edu Wed Apr 21 21:49:33 2010 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Wed, 21 Apr 2010 21:49:33 -0500 Subject: [Swift-devel] coasters submitting improper gt2 rsl jobs In-Reply-To: <1271904050.559.1.camel@localhost> References: <20100422015440.GA24449@communicado.ci.uchicago.edu> <1271902528.31945.22.camel@localhost> <1271904050.559.1.camel@localhost> Message-ID: I can't remember the last time I used builds from the stable branch :P 2010/4/21 Mihael Hategan : > On Wed, 2010-04-21 at 21:18 -0500, Allan Espinosa wrote: >> cool. so i just update my tree then? > > Yep. > > Though are you sure you want to use the trunk code for what you are > doing? From hategan at mcs.anl.gov Thu Apr 22 14:40:20 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 22 Apr 2010 14:40:20 -0500 Subject: [Swift-devel] coaster worker.pl syntax errors for perl installations on RHEL4 variants In-Reply-To: References: Message-ID: <1271965220.13776.0.camel@localhost> On Wed, 2010-04-21 at 21:48 -0500, Allan Espinosa wrote: > Below's a small modification to run on lower perl versions. I queried > OSG RESS and most of them are RHEL4 variants. It's a bug. It's there on more recent versions of perl, too. Should be fixed in svn. From wilde at mcs.anl.gov Mon Apr 26 15:12:50 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 26 Apr 2010 15:12:50 -0500 (CDT) Subject: [Swift-devel] Coaster error in RaptorLoops run In-Reply-To: <4BD5EB22.1020009@mcs.anl.gov> Message-ID: <17484956.647181272312770079.JavaMail.root@zimbra> Wenjun, can you post more details on the problem you describe below, to the swift-devel list (cc'ed here) pointing Mihael to a directory with all your logs and config files? Thanks, Mike ----- "wenjun wu" wrote: > Hi Mike, > Now I can run raptorloop locally but when I launch the jobs to > PADS > through coaster:ssh:pbs, I keep getting the following exceptions > after the swift finishes the most steps. > > 2010-04-26 13:27:25,408-0500 INFO AbstractStreamKarajanChannel > 01173289853: Channel shut down > java.lang.Throwable > at > org.globus.cog.karajan.workflow.service.channels.AbstractTCPChannel.close(AbstractTCPChannel.java:97) > at > org.globus.cog.karajan.workflow.service.channels.MetaChannel.close(MetaChannel.java:87) > at > org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.statusChanged(ServiceManager.java:232) > at > org.globus.cog.abstraction.impl.common.task.TaskImpl.notifyListeners(TaskImpl.java:236) > at > org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:224) > at > org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:253) > > at > org.globus.cog.abstraction.impl.ssh.execution.JobSubmissionTaskHandler.SSHTaskStatusChanged(JobSubmissionTaskHandler.java:193) > at > org.globus.cog.abstraction.impl.ssh.SSHRunner.notifyListeners(SSHRunner.java:84) > at > org.globus.cog.abstraction.impl.ssh.SSHRunner.run(SSHRunner.java:43) > > at java.lang.Thread.run(Thread.java:595) > 2010-04-26 13:27:25,408-0500 INFO ChannelManager Handling channel > exception > java.io.IOException: Stream closed. at > java.net.PlainSocketImpl.available(PlainSocketImpl.java:428) > at > java.net.SocketInputStream.available(SocketInputStream.java:217) > at > org.globus.gsi.gssapi.net.GssInputStream.available(GssInputStream.java:107) > > at > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.step(AbstractStreamKarajanChannel.java:113) > at > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Multiplexer.run(AbstractStreamKarajanChannel.java:365) > > Progress: Finished successfully:7 > Progress: Active:1 Finished successfully:7 > Progress: Active:1 Finished successfully:7 > Progress: Active:1 Finished successfully:7 > Progress: Checking status:1 Finished successfully:7 > Progress: Finished successfully:8 > Progress: Finished successfully:8 > Progress: Finished successfully:8 > Progress: Finished successfully:8 > Progress: Finished successfully:8 > Progress: Finished successfully:8 > Progress: Finished successfully:8 > > Wenjun > > was: Re: notes from todays meeting > > > > Hi Aashish, > > > > Wenjun and Tom are integrated the latest OOPS scripts into the > portal for Web execution. > > > > Wenjun is getting errors, as below. I suspect he's missing some > parameters or has incorrect parameters or inputs. > > > > Can you send to Wenjun the latest parameters (ie shell calling > examples) to run Loops, RaptorLoops, and RaptorLoops with prep stage? > > > > Best thing to do is quickly update README with the lastest shell > invocation lines and check it in; then Wenjun can verify that the > latest documented invocation instructions work for other people (which > will be useful for the OOPS group too!) > > > > I cant get to this till late today or early this weekend, so any > help you can offer will be great. > > > > Thanks! > > > > - Mike > > > > ----- "wenjun wu" wrote: > > > > > >> Hi Mike, > >> I run the raptorloop.sh and got the following error. Any clue? > >> Wenjun > >> > >> [wwj at login1 wwjtest]$ run.raptorloops.sh -target T1af7 -prepTar > >> T1af7.prep.tar.gz -templatesPerJob 800 > >> Running in > >> > /gpfs/pads/oops/scienceportal/oops-svn/oops/protlib2-0422/wwjtest/run.raptorloops.9229 > >> Running RaptorLoops with settings: target=T1af7 seqFile= > >> prepTar=T1af7.prep.tar.gz templatesPerJob=800 templateList= > nModels= > >> nSim=4 execsite=localhost maxSlots=16 resume= rlog= > >> Running from host with compute-node reachable address of > 172.5.86.5 > >> protlib2 home is > >> /gpfs/pads/oops/scienceportal/oops-svn/oops/protlib2-0422 > >> cp: warning: source file > >> > `/gpfs/pads/oops/scienceportal/oops-svn/oops/protlib2-0422/swift/RaptorOut.map' > >> > >> specified more than once > >> cp: warning: source file > >> > `/gpfs/pads/oops/scienceportal/oops-svn/oops/protlib2-0422/swift/TemplateList.map' > >> > >> specified more than once > >> cp: missing destination file operand after `.' > >> Try `cp --help' for more information. > >> basename: missing operand > >> Try `basename --help' for more information. > >> Variable nModels defined in scope 7122710 shadows variable of same > >> name > >> in scope 4890830 > >> Variable tseg defined in scope 26460367 shadows variable of same > name > >> in > >> scope 4890830 > >> Variable preparedInput defined in scope 26460367 shadows variable > of > >> same name in scope 4890830 > >> Variable nModels defined in scope 26460367 shadows variable of > same > >> name > >> in scope 4890830 > >> Variable targetId defined in scope 12182618 shadows variable of > same > >> name in scope 4890830 > >> Variable modelIn defined in scope 12182618 shadows variable of > same > >> name > >> in scope 4890830 > >> Variable targetId defined in scope 21925102 shadows variable of > same > >> name in scope 4890830 > >> Variable models defined in scope 21925102 shadows variable of same > >> name > >> in scope 4890830 > >> Swift svn swift-r3246 cog-r2721 > >> > >> RunID: 20100422-1609-aqv1y329 > >> Progress: > >> Execution failed: > >> java.lang.NumberFormatException: For input string: "" > >> > >> > >>> Wenjun, > >>> > >>> The first two we need are psim.loops.swift and RaptorLoops.swift, > >>> > >> and their corresponding runs scripts. > >> > >>> We run them from the corresponding .sh sripts in scripts/run > >>> > >>> I'll get back to you on this tonight with more details...after I > >>> > >> look for my 3rd script which is RaptorLoops with an addiitonal > >> pre-process step that takes a raw fasta file as input. I may need > to > >> check that in from my workspace. > >> > >>> - Mike > >>> > >>> > >>> ----- "wenjun wu" wrote: > >>> > >>> > >>> > >>>> Hi Mike: > >>>> I installed the latest version of protlib from SVN. I'd > like > >>>> > >> to > >> > >>>> clarify which swift scripts are needed into the portal. > >>>> > >>>> These are the swift scripts in the latest protlib2: > >>>> > >>>> rw-r--r-- 1 wwj ci-users 737 Apr 22 11:48 SwiftLib.swift > >>>> -rw-r--r-- 1 wwj ci-users 3237 Apr 22 11:48 psim.itfixex2.swift > >>>> -rw-r--r-- 1 wwj ci-users 2127 Apr 22 11:48 psim.itfixex1.swift > >>>> -rwxr-xr-x 1 wwj ci-users 509 Apr 22 11:48 psim.basicex1.swift > >>>> -rw-r--r-- 1 wwj ci-users 2616 Apr 22 11:48 BoostThreader.swift > >>>> -rw-r--r-- 1 wwj ci-users 1477 Apr 22 11:48 LoopLib.swift > >>>> -rw-r--r-- 1 wwj ci-users 1193 Apr 22 11:48 > >>>> > >> BoostThreaderLib.swift > >> > >>>> -rw-r--r-- 1 wwj ci-users 8869 Apr 22 11:48 oops.swift > >>>> -rw-r--r-- 1 wwj ci-users 1525 Apr 22 11:48 psim.sweepex1.swift > >>>> -rwxr-xr-x 1 wwj ci-users 2188 Apr 22 11:48 psim.swift > >>>> -rw-r--r-- 1 wwj ci-users 2933 Apr 22 11:48 psim.loops.swift > >>>> -rw-r--r-- 1 wwj ci-users 6820 Apr 22 11:48 > >>>> RaptorLoops.hanging.swift > >>>> -rw-r--r-- 1 wwj ci-users 2943 Apr 22 11:48 RaptorLoops.swift > >>>> > >>>> I guess the right swift scripts should be: psim.loops, > >>>> > >> BoostThreader > >> > >>>> and RaptorLoop. > >>>> I need to create packages for both Raptor-BoostThreader and > >>>> RaptorLoop > >>>> by grouping swift scripts and mapper scripts. > >>>> > >>>> > >>>> Wenjun > >>>> > >>>> > >>>>> DataPort 2010.0421 > >>>>> > >>>>> Coaster proxy issue: can Mihael automate this? > >>>>> > >>>>> Coaster proxy issue - use long proxy for now. > >>>>> > >>>>> Swift run status reporter? > >>>>> > >>>>> Adding new scripts and forms > >>>>> - how to shape the args? Like the email form? > >>>>> > >>>>> Need automation just for caps requests, then manual for Aashish > >>>>> > >>>>> > >>>> tests, then portal for Carl, Tobin et al > >>>> > >>>> > >>>>> Email notification > >>>>> > >>>>> Control over which swift the portal is running > >>>>> > >>>>> > >>>>> > >>>>> > >>> > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wwj at ci.uchicago.edu Mon Apr 26 16:01:06 2010 From: wwj at ci.uchicago.edu (Wenjun Wu) Date: Mon, 26 Apr 2010 16:01:06 -0500 Subject: [Swift-devel] Coaster error in RaptorLoops run In-Reply-To: <17484956.647181272312770079.JavaMail.root@zimbra> References: <17484956.647181272312770079.JavaMail.root@zimbra> Message-ID: <4BD5FF12.1000702@ci.uchicago.edu> Sure. My config files are under the folder: /gpfs/pads/oops/scienceportal/swift-svn/etc the logs can be found at /gpfs/pads/oops/scienceportal/scriptadmin/oops-raptorloop/test/RaptorLoops-20100426-1314-tna3q0a6.log Wenjun > Wenjun, can you post more details on the problem you describe below, to the swift-devel list (cc'ed here) pointing Mihael to a directory with all your logs and config files? > > Thanks, > > Mike > > ----- "wenjun wu" wrote: > > >> Hi Mike, >> Now I can run raptorloop locally but when I launch the jobs to >> PADS >> through coaster:ssh:pbs, I keep getting the following exceptions >> after the swift finishes the most steps. >> >> 2010-04-26 13:27:25,408-0500 INFO AbstractStreamKarajanChannel >> 01173289853: Channel shut down >> java.lang.Throwable >> at >> org.globus.cog.karajan.workflow.service.channels.AbstractTCPChannel.close(AbstractTCPChannel.java:97) >> at >> org.globus.cog.karajan.workflow.service.channels.MetaChannel.close(MetaChannel.java:87) >> at >> org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.statusChanged(ServiceManager.java:232) >> at >> org.globus.cog.abstraction.impl.common.task.TaskImpl.notifyListeners(TaskImpl.java:236) >> at >> org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:224) >> at >> org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:253) >> >> at >> org.globus.cog.abstraction.impl.ssh.execution.JobSubmissionTaskHandler.SSHTaskStatusChanged(JobSubmissionTaskHandler.java:193) >> at >> org.globus.cog.abstraction.impl.ssh.SSHRunner.notifyListeners(SSHRunner.java:84) >> at >> org.globus.cog.abstraction.impl.ssh.SSHRunner.run(SSHRunner.java:43) >> >> at java.lang.Thread.run(Thread.java:595) >> 2010-04-26 13:27:25,408-0500 INFO ChannelManager Handling channel >> exception >> java.io.IOException: Stream closed. at >> java.net.PlainSocketImpl.available(PlainSocketImpl.java:428) >> at >> java.net.SocketInputStream.available(SocketInputStream.java:217) >> at >> org.globus.gsi.gssapi.net.GssInputStream.available(GssInputStream.java:107) >> >> at >> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.step(AbstractStreamKarajanChannel.java:113) >> at >> org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Multiplexer.run(AbstractStreamKarajanChannel.java:365) >> >> Progress: Finished successfully:7 >> Progress: Active:1 Finished successfully:7 >> Progress: Active:1 Finished successfully:7 >> Progress: Active:1 Finished successfully:7 >> Progress: Checking status:1 Finished successfully:7 >> Progress: Finished successfully:8 >> Progress: Finished successfully:8 >> Progress: Finished successfully:8 >> Progress: Finished successfully:8 >> Progress: Finished successfully:8 >> Progress: Finished successfully:8 >> Progress: Finished successfully:8 >> >> Wenjun >> >>> was: Re: notes from todays meeting >>> >>> Hi Aashish, >>> >>> Wenjun and Tom are integrated the latest OOPS scripts into the >>> >> portal for Web execution. >> >>> Wenjun is getting errors, as below. I suspect he's missing some >>> >> parameters or has incorrect parameters or inputs. >> >>> Can you send to Wenjun the latest parameters (ie shell calling >>> >> examples) to run Loops, RaptorLoops, and RaptorLoops with prep stage? >> >>> Best thing to do is quickly update README with the lastest shell >>> >> invocation lines and check it in; then Wenjun can verify that the >> latest documented invocation instructions work for other people (which >> will be useful for the OOPS group too!) >> >>> I cant get to this till late today or early this weekend, so any >>> >> help you can offer will be great. >> >>> Thanks! >>> >>> - Mike >>> >>> ----- "wenjun wu" wrote: >>> >>> >>> >>>> Hi Mike, >>>> I run the raptorloop.sh and got the following error. Any clue? >>>> Wenjun >>>> >>>> [wwj at login1 wwjtest]$ run.raptorloops.sh -target T1af7 -prepTar >>>> T1af7.prep.tar.gz -templatesPerJob 800 >>>> Running in >>>> >>>> >> /gpfs/pads/oops/scienceportal/oops-svn/oops/protlib2-0422/wwjtest/run.raptorloops.9229 >> >>>> Running RaptorLoops with settings: target=T1af7 seqFile= >>>> prepTar=T1af7.prep.tar.gz templatesPerJob=800 templateList= >>>> >> nModels= >> >>>> nSim=4 execsite=localhost maxSlots=16 resume= rlog= >>>> Running from host with compute-node reachable address of >>>> >> 172.5.86.5 >> >>>> protlib2 home is >>>> /gpfs/pads/oops/scienceportal/oops-svn/oops/protlib2-0422 >>>> cp: warning: source file >>>> >>>> >> `/gpfs/pads/oops/scienceportal/oops-svn/oops/protlib2-0422/swift/RaptorOut.map' >> >>>> specified more than once >>>> cp: warning: source file >>>> >>>> >> `/gpfs/pads/oops/scienceportal/oops-svn/oops/protlib2-0422/swift/TemplateList.map' >> >>>> specified more than once >>>> cp: missing destination file operand after `.' >>>> Try `cp --help' for more information. >>>> basename: missing operand >>>> Try `basename --help' for more information. >>>> Variable nModels defined in scope 7122710 shadows variable of same >>>> name >>>> in scope 4890830 >>>> Variable tseg defined in scope 26460367 shadows variable of same >>>> >> name >> >>>> in >>>> scope 4890830 >>>> Variable preparedInput defined in scope 26460367 shadows variable >>>> >> of >> >>>> same name in scope 4890830 >>>> Variable nModels defined in scope 26460367 shadows variable of >>>> >> same >> >>>> name >>>> in scope 4890830 >>>> Variable targetId defined in scope 12182618 shadows variable of >>>> >> same >> >>>> name in scope 4890830 >>>> Variable modelIn defined in scope 12182618 shadows variable of >>>> >> same >> >>>> name >>>> in scope 4890830 >>>> Variable targetId defined in scope 21925102 shadows variable of >>>> >> same >> >>>> name in scope 4890830 >>>> Variable models defined in scope 21925102 shadows variable of same >>>> name >>>> in scope 4890830 >>>> Swift svn swift-r3246 cog-r2721 >>>> >>>> RunID: 20100422-1609-aqv1y329 >>>> Progress: >>>> Execution failed: >>>> java.lang.NumberFormatException: For input string: "" >>>> >>>> >>>> >>>>> Wenjun, >>>>> >>>>> The first two we need are psim.loops.swift and RaptorLoops.swift, >>>>> >>>>> >>>> and their corresponding runs scripts. >>>> >>>> >>>>> We run them from the corresponding .sh sripts in scripts/run >>>>> >>>>> I'll get back to you on this tonight with more details...after I >>>>> >>>>> >>>> look for my 3rd script which is RaptorLoops with an addiitonal >>>> pre-process step that takes a raw fasta file as input. I may need >>>> >> to >> >>>> check that in from my workspace. >>>> >>>> >>>>> - Mike >>>>> >>>>> >>>>> ----- "wenjun wu" wrote: >>>>> >>>>> >>>>> >>>>> >>>>>> Hi Mike: >>>>>> I installed the latest version of protlib from SVN. I'd >>>>>> >> like >> >>>>>> >>>>>> >>>> to >>>> >>>> >>>>>> clarify which swift scripts are needed into the portal. >>>>>> >>>>>> These are the swift scripts in the latest protlib2: >>>>>> >>>>>> rw-r--r-- 1 wwj ci-users 737 Apr 22 11:48 SwiftLib.swift >>>>>> -rw-r--r-- 1 wwj ci-users 3237 Apr 22 11:48 psim.itfixex2.swift >>>>>> -rw-r--r-- 1 wwj ci-users 2127 Apr 22 11:48 psim.itfixex1.swift >>>>>> -rwxr-xr-x 1 wwj ci-users 509 Apr 22 11:48 psim.basicex1.swift >>>>>> -rw-r--r-- 1 wwj ci-users 2616 Apr 22 11:48 BoostThreader.swift >>>>>> -rw-r--r-- 1 wwj ci-users 1477 Apr 22 11:48 LoopLib.swift >>>>>> -rw-r--r-- 1 wwj ci-users 1193 Apr 22 11:48 >>>>>> >>>>>> >>>> BoostThreaderLib.swift >>>> >>>> >>>>>> -rw-r--r-- 1 wwj ci-users 8869 Apr 22 11:48 oops.swift >>>>>> -rw-r--r-- 1 wwj ci-users 1525 Apr 22 11:48 psim.sweepex1.swift >>>>>> -rwxr-xr-x 1 wwj ci-users 2188 Apr 22 11:48 psim.swift >>>>>> -rw-r--r-- 1 wwj ci-users 2933 Apr 22 11:48 psim.loops.swift >>>>>> -rw-r--r-- 1 wwj ci-users 6820 Apr 22 11:48 >>>>>> RaptorLoops.hanging.swift >>>>>> -rw-r--r-- 1 wwj ci-users 2943 Apr 22 11:48 RaptorLoops.swift >>>>>> >>>>>> I guess the right swift scripts should be: psim.loops, >>>>>> >>>>>> >>>> BoostThreader >>>> >>>> >>>>>> and RaptorLoop. >>>>>> I need to create packages for both Raptor-BoostThreader and >>>>>> RaptorLoop >>>>>> by grouping swift scripts and mapper scripts. >>>>>> >>>>>> >>>>>> Wenjun >>>>>> >>>>>> >>>>>> >>>>>>> DataPort 2010.0421 >>>>>>> >>>>>>> Coaster proxy issue: can Mihael automate this? >>>>>>> >>>>>>> Coaster proxy issue - use long proxy for now. >>>>>>> >>>>>>> Swift run status reporter? >>>>>>> >>>>>>> Adding new scripts and forms >>>>>>> - how to shape the args? Like the email form? >>>>>>> >>>>>>> Need automation just for caps requests, then manual for Aashish >>>>>>> >>>>>>> >>>>>>> >>>>>> tests, then portal for Carl, Tobin et al >>>>>> >>>>>> >>>>>> >>>>>>> Email notification >>>>>>> >>>>>>> Control over which swift the portal is running >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>> >>>>> >>> > From aespinosa at cs.uchicago.edu Mon Apr 26 19:39:12 2010 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Mon, 26 Apr 2010 19:39:12 -0500 Subject: [Swift-devel] build errors on the stable branch Message-ID: With a fresh checkout/ export in cog 4.1.7 branch and swift 1.0 branch, I get the following build errors: [javac] /autonfs/home/aespinosa/work/cogkit/modules/swift/src/org/globus/swift/catalog/util/ProfileParserException.java:23: warning: unmappable character for encoding UTF8 [javac] * @author Jens-S. V?ckler [javac] ^ [javac] /autonfs/home/aespinosa/work/cogkit/modules/swift/src/org/globus/swift/catalog/util/Separator.java:25: warning: unmappable character for encoding UTF8 [javac] * @author Jens-S. V?ckler [javac] ^ [javac] /autonfs/home/aespinosa/work/cogkit/modules/swift/src/org/griphyn/vdl/karajan/lib/SetFieldValue.java:51: cannot find symbol [javac] symbol : method getLast() [javac] location: class org.griphyn.vdl.mapping.Path [javac] markAsAvailable(stack, leaf.getParent(), leaf.getPathFromRoot().getLast()); [javac] ^ [javac] Note: /autonfs/home/aespinosa/work/cogkit/modules/swift/src/org/griphyn/vdl/karajan/VDL2ErrorTranslator.java uses or overrides a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] 1 error [javac] 5 warnings BUILD FAILED /autonfs/home/aespinosa/work/cogkit/modules/swift/build.xml:73: The following error occurred while executing this line: /autonfs/home/aespinosa/work/cogkit/mbuild.xml:464: The following error occurred while executing this line: /autonfs/home/aespinosa/work/cogkit/mbuild.xml:228: Compile failed; see the compiler error output for details. Total time: 24 seconds This doesn't occur in a fresh checkout/ export of cog-trunk and swift-trunk My build config: $ java -version java version "1.6.0_03" Java(TM) SE Runtime Environment (build 1.6.0_03-b05) Java HotSpot(TM) Server VM (build 1.6.0_03-b05, mixed mode) I tried building with java1.5 and got the same build errors as well. -Allan -- Allan M. Espinosa PhD student, Computer Science University of Chicago From aespinosa at cs.uchicago.edu Mon Apr 26 20:41:07 2010 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Mon, 26 Apr 2010 20:41:07 -0500 Subject: [Swift-devel] coaster testing on OSG Message-ID: Using cog and swift trunk versions, attached are the log files in running coasters on ff-grid.unl.edu using a gt2:gt2:fork endpoint Take note that I commented out some lines in worker.pl because the perl interpreter in Firefly gives syntax error replies: diff --git a/modules/provider-coaster/resources/worker.pl b/modules/provider-co index 0e716fd..6d8e59f 100755 --- a/modules/provider-coaster/resources/worker.pl +++ b/modules/provider-coaster/resources/worker.pl @@ -166,7 +166,8 @@ sub reconnect() { } } if ($any) { - die "Failed to connect: $!"; + #die "Failed to connect: $!"; + die "Failed to connect: "; } $LAST_HEARTBEAT = time(); } @@ -189,7 +190,7 @@ sub init() { my $schemes = join(", ", @SCHEME); my $hosts = join(", ", @HOSTNAME); my $ports = join(", ", @PORT); - wlog DEBUG, "uri=$URI, scheme=$schemes, host=$hosts, port=$ports, block + #wlog DEBUG, "uri=$URI, scheme=$schemes, host=$hosts, port=$ports, blockid=$ reconnect(); } -Allan -- Allan M. Espinosa PhD student, Computer Science University of Chicago -------------- next part -------------- A non-text attachment was scrubbed... Name: coaster_firefly.tar.gz Type: application/x-gzip Size: 17970 bytes Desc: not available URL: From yizhu at cs.uchicago.edu Tue Apr 27 13:18:33 2010 From: yizhu at cs.uchicago.edu (Yi Zhu) Date: Tue, 27 Apr 2010 13:18:33 -0500 Subject: [Swift-devel] Re: coaster on EC2 (error log) In-Reply-To: <11666950.655621272328969454.JavaMail.root@zimbra> References: <11666950.655621272328969454.JavaMail.root@zimbra> Message-ID: <4BD72A79.6090501@cs.uchicago.edu> Hi I got a problem when I set the the provider to coaster, I got the the following error: Progress: Progress: Stage in:1 Progress: Submitted:1 Failed to transfer wrapper log from first-20100427-1308-o7o8r3c1/info/d on ec2 Execution failed: Exception in echo: Arguments: [Hello, world!] Host: ec2 Directory: first-20100427-1308-o7o8r3c1/jobs/d/echo-dfktx5rj stderr.txt: stdout.txt: ---- Caused by: Could not submit job Caused by: Could not start coaster service Caused by: Task ended before registration was received. STDOUT: STDERR: Caused by: Job failed with an exit code of 1 Cleaning up... Done -bash-3.2$ I also checked the coaster log in server node, it shows it need a binary file called /gmd5sum/, I searched Google and found that gmd5sum is a windows-base jar file, maybe coaster need /md5sum/ instead of /gmd5sum/? ( md5sum is installed on host server by default) [torqueuser at ip-10-251-214-179 ~]$ cat coaster-bootstrap-11894108087.log using plain mode BS: http://tp-login2.ci.uchicago.edu:37470 which: no gmd5sum in (/opt/vdt-1.10.1/gums/scripts:/opt/vdt-1.10.1/prima/bin:/opt/vdt-1.10.1/cert-scripts/bin:/opt/vdt-1.10.1/glite/sbin:/opt/vdt-1.10.1/glite/bin:/opt/vdt-1.10.1/jdk1.5/bin:/opt/vdt-1.10.1/edg/sbin:/opt/vdt-1.10.1/gip/bin:/opt/vdt-1.10.1/gpt/sbin:/opt/vdt-1.10.1/globus/bin:/opt/vdt-1.10.1/globus/sbin:/opt/vdt-1.10.1/wget/bin:/opt/vdt-1.10.1/logrotate/sbin:/opt/vdt-1.10.1/perl/bin:/opt/pacman-3.26/bin:/opt/vdt-1.10.1/vdt/sbin:/opt/vdt-1.10.1/vdt/bin:/opt/pacman-3.26/bin:/usr/local/bin:/bin:/usr/bin) Expected checksum: acab90e149a0188fbc963803a42156c5 Computed checksum: acab90e149a0188fbc963803a42156c5 JAVA=/opt/vdt-1.10.1/jdk1.5/bin/java plain /opt/vdt-1.10.1/jdk1.5/bin/java -Djava=/opt/vdt-1.10.1/jdk1.5/bin/java -DGLOBUS_TCP_PORT_RANGE= -DX509_USER_PROXY= -DX509_CERT_DIR=/opt/vdt-1.10.1/globus/TRUSTED_CA -DGLOBUS_HOSTNAME=ec2-204-236-204-71.compute-1.amazonaws.com -jar /tmp/bootstrap.Y19911 http://tp-login2.ci.uchicago.edu:37470 https://128.135.125.117:35183 11894108087 the sites.xml files I used: 1 1 5 5 1 10000 /home/torqueuser/swiftwork -Yi On 4/26/2010 7:42 PM, Michael Wilde wrote: > SOunds great, Yi - thanks for the update, and David for the assistance. > > Yes, we can meet tomorrow. Is 3:30 - 4:30 OK? > > In the meantime, can you send me the sites.xml file you used for coasters, and point me to the directory that contains stdout/err and the swift .log file for the failing run? (Please send this to swift-devel with a description of what you did. That way Mihael and others can help as well.) > > Thanks, > > Mike > > > ----- "Yi Zhu" wrote: > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Tue Apr 27 13:59:44 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 27 Apr 2010 13:59:44 -0500 Subject: [Swift-devel] Re: coaster on EC2 (error log) In-Reply-To: <4BD72A79.6090501@cs.uchicago.edu> References: <11666950.655621272328969454.JavaMail.root@zimbra> <4BD72A79.6090501@cs.uchicago.edu> Message-ID: <1272394784.8242.2.camel@localhost> On Tue, 2010-04-27 at 13:18 -0500, Yi Zhu wrote: > I also checked the coaster log in server node, it shows it need a > binary file called gmd5sum, It shows it looked for gmd5sum and didn't find it. So it tries md5sum instead. I believe gmd5sum is the default on OS X and the reason why it looks for it. In any event, if the coaster process doesn't fail saying "didn't find gmd5sum or md5sum", then that's likely not your problem, and judging from the log below which says "Computed checksum...", md5sum was found. > [...] > coaster-bootstrap-11894108087.log > using plain mode > BS: http://tp-login2.ci.uchicago.edu:37470 > which: no gmd5sum in > (/opt/vdt-1.10.1/gums/scripts:/opt/vdt-1.10.1/prima/bin:/opt/vdt-1.10.1/cert-scripts/bin:/opt/vdt-1.10.1/glite/sbin:/opt/vdt-1.10.1/glite/bin:/opt/vdt-1.10.1/jdk1.5/bin:/opt/vdt-1.10.1/edg/sbin:/opt/vdt-1.10.1/gip/bin:/opt/vdt-1.10.1/gpt/sbin:/opt/vdt-1.10.1/globus/bin:/opt/vdt-1.10.1/globus/sbin:/opt/vdt-1.10.1/wget/bin:/opt/vdt-1.10.1/logrotate/sbin:/opt/vdt-1.10.1/perl/bin:/opt/pacman-3.26/bin:/opt/vdt-1.10.1/vdt/sbin:/opt/vdt-1.10.1/vdt/bin:/opt/pacman-3.26/bin:/usr/local/bin:/bin:/usr/bin) > Expected checksum: acab90e149a0188fbc963803a42156c5 > Computed checksum: acab90e149a0188fbc963803a42156c5 [...] From yizhu at cs.uchicago.edu Tue Apr 27 14:54:49 2010 From: yizhu at cs.uchicago.edu (Yi Zhu) Date: Tue, 27 Apr 2010 14:54:49 -0500 Subject: [Swift-devel] Re: coaster on EC2 (error log) In-Reply-To: <1272394784.8242.2.camel@localhost> References: <11666950.655621272328969454.JavaMail.root@zimbra> <4BD72A79.6090501@cs.uchicago.edu> <1272394784.8242.2.camel@localhost> Message-ID: <4BD74109.50009@cs.uchicago.edu> Hi Mihael Thanks! You are absolutely right. In the mean time, I found there is a "EC /number" / line at the end of the coaster log, I assume it indicate a error code, then I tried to played with the coaster with different parameter of job provider and get the following result: 1) running without coaster --- gt2 + ssh +pbs (successful) 2) running without coaster --- gt2 + gridftp + pbs(successful) 3) coaster + gt2+ gridftp + pbs (failed) sites.xml 1 1 5 2 1 10000 /home/torqueuser/swiftwork screen output: -bash-3.2$ swift -tc.file tc.test.data -sites.file sshpbscoast.xml first.swift Swift svn swift-r3276 (swift modified locally) cog-r2739 (cog modified locally) RunID: 20100427-1449-gfryv636 Progress: Progress: Stage in:1 Progress: Submitted:1 Progress: Active:1 Failed to transfer wrapper log from first-20100427-1449-gfryv636/info/h on ec2 Progress: Failed:1 Execution failed: Exception in echo: Arguments: [Hello, world!] Host: ec2 Directory: first-20100427-1449-gfryv636/jobs/h/echo-hveu16rj stderr.txt: stdout.txt: ---- Caused by: Task failed: Error submitting block task org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Cannot submit job at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:146) at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submit(JobSubmissionTaskHandler.java:100) at org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) at org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) at org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.run(BlockTaskSubmitter.java:66) Caused by: org.globus.gram.GramException: Data transfer to the server failed [Caused by: Authentication failed [Caused by: Failure unspecified at GSS-API level [Caused by: Unknown CA]]] at org.globus.gram.Gram.request(Gram.java:334) at org.globus.gram.GramJob.request(GramJob.java:262) at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.submitSingleJob(JobSubmissionTaskHandler.java:134) ... 4 more Cleaning up... Shutting down service at https://10.251.214.179:50260 Got channel MetaChannel: 499668036 -> GSSSChannel-01006506816(1) + Done -bash-3.2$ coaster log: [...] EC 13 since I can submit job to gt2 via swift successfully, it should not have any authentication issues to gt2, but when I changed the provider to coaster, I got [Unknown CA] error, is it because of any possible authentication issue between coaster server node and gt2? -Yi Zhu On 4/27/2010 1:59 PM, Mihael Hategan wrote: > On Tue, 2010-04-27 at 13:18 -0500, Yi Zhu wrote: > > >> I also checked the coaster log in server node, it shows it need a >> binary file called gmd5sum, >> > It shows it looked for gmd5sum and didn't find it. So it tries md5sum > instead. I believe gmd5sum is the default on OS X and the reason why it > looks for it. In any event, if the coaster process doesn't fail saying > "didn't find gmd5sum or md5sum", then that's likely not your problem, > and judging from the log below which says "Computed checksum...", md5sum > was found. > > >> [...] >> > >> coaster-bootstrap-11894108087.log >> using plain mode >> BS: http://tp-login2.ci.uchicago.edu:37470 >> which: no gmd5sum in >> (/opt/vdt-1.10.1/gums/scripts:/opt/vdt-1.10.1/prima/bin:/opt/vdt-1.10.1/cert-scripts/bin:/opt/vdt-1.10.1/glite/sbin:/opt/vdt-1.10.1/glite/bin:/opt/vdt-1.10.1/jdk1.5/bin:/opt/vdt-1.10.1/edg/sbin:/opt/vdt-1.10.1/gip/bin:/opt/vdt-1.10.1/gpt/sbin:/opt/vdt-1.10.1/globus/bin:/opt/vdt-1.10.1/globus/sbin:/opt/vdt-1.10.1/wget/bin:/opt/vdt-1.10.1/logrotate/sbin:/opt/vdt-1.10.1/perl/bin:/opt/pacman-3.26/bin:/opt/vdt-1.10.1/vdt/sbin:/opt/vdt-1.10.1/vdt/bin:/opt/pacman-3.26/bin:/usr/local/bin:/bin:/usr/bin) >> Expected checksum: acab90e149a0188fbc963803a42156c5 >> Computed checksum: acab90e149a0188fbc963803a42156c5 >> > [...] > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Wed Apr 28 18:26:55 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 28 Apr 2010 18:26:55 -0500 (CDT) Subject: [Swift-devel] deep field bug In-Reply-To: <1269836262.21332.9.camel@localhost> Message-ID: <21102453.724491272497215588.JavaMail.root@zimbra> To close this issue down: Mihael provided the following second patch on top of the diff below: http://www.mcs.anl.gov/~hategan/deepfieldbug2.patch This has been working for me, and appears to completely solve this problem. Mihael, can you commit these two patches? Thanks, Mike ----- "Mihael Hategan" wrote: > Mike (and possibly others) have been experiencing a bit of a problem > with a certain type of script. > > The basic idea is an app returning a complex structure. The example > Mike > gave was (irrelevant stuff removed): > > type filestruct { > file filefield; > } > > app (filestruct fs[]) touchem () { touchem; } > > Where fs is mapped properly with a fixed array mapper. > > The problem with that is that closedataset (which is called when > touchem > is done) only closes fs and fs[*], but not fs[*].filefield. For some > reason the method used internally is closeChildren(). > > Now, it seems quite obvious that that's bogus and a deep close should > be done instead when returning from an app. Furthermore, I see no > need > for closeChildren() to ever be called on its own (other than perhaps > by > a deep close). So I'm sending this email to see if anybody (Ben that > is) > is of a different opinion. > > In the mean-time, for those who want to test a possible solution (and > please run the whole test suite in the process): > http://www.mcs.anl.gov/~hategan/deepfieldbug.diff > > (you need to apply it in swift/src/org/griphyn/vdl). > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Wed Apr 28 18:32:25 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 28 Apr 2010 18:32:25 -0500 (CDT) Subject: [Swift-devel] Coaster error in RaptorLoops run In-Reply-To: <4BD5FF12.1000702@ci.uchicago.edu> Message-ID: <1377152.724561272497545070.JavaMail.root@zimbra> Looking at these logs closely with Wenjun, it seems that his run stumbled into the deepfield bug, for which the 2 patches mentioned in my prior message are needed. So for now, this problem can be ignored. Wenjun is retesting with a patched stable swift branch. - Mike ----- "Wenjun Wu" wrote: > Sure. My config files are under the folder: > /gpfs/pads/oops/scienceportal/swift-svn/etc > > the logs can be found at > /gpfs/pads/oops/scienceportal/scriptadmin/oops-raptorloop/test/RaptorLoops-20100426-1314-tna3q0a6.log > > Wenjun > > Wenjun, can you post more details on the problem you describe below, > to the swift-devel list (cc'ed here) pointing Mihael to a directory > with all your logs and config files? > > > > Thanks, > > > > Mike > > > > ----- "wenjun wu" wrote: > > > > > >> Hi Mike, > >> Now I can run raptorloop locally but when I launch the jobs > to > >> PADS > >> through coaster:ssh:pbs, I keep getting the following exceptions > >> after the swift finishes the most steps. > >> > >> 2010-04-26 13:27:25,408-0500 INFO AbstractStreamKarajanChannel > >> 01173289853: Channel shut down > >> java.lang.Throwable > >> at > >> > org.globus.cog.karajan.workflow.service.channels.AbstractTCPChannel.close(AbstractTCPChannel.java:97) > >> at > >> > org.globus.cog.karajan.workflow.service.channels.MetaChannel.close(MetaChannel.java:87) > >> at > >> > org.globus.cog.abstraction.impl.execution.coaster.ServiceManager.statusChanged(ServiceManager.java:232) > >> at > >> > org.globus.cog.abstraction.impl.common.task.TaskImpl.notifyListeners(TaskImpl.java:236) > >> at > >> > org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:224) > >> at > >> > org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:253) > >> > >> at > >> > org.globus.cog.abstraction.impl.ssh.execution.JobSubmissionTaskHandler.SSHTaskStatusChanged(JobSubmissionTaskHandler.java:193) > >> at > >> > org.globus.cog.abstraction.impl.ssh.SSHRunner.notifyListeners(SSHRunner.java:84) > >> at > >> > org.globus.cog.abstraction.impl.ssh.SSHRunner.run(SSHRunner.java:43) > >> > >> at java.lang.Thread.run(Thread.java:595) > >> 2010-04-26 13:27:25,408-0500 INFO ChannelManager Handling channel > >> exception > >> java.io.IOException: Stream closed. at > >> java.net.PlainSocketImpl.available(PlainSocketImpl.java:428) > >> at > >> java.net.SocketInputStream.available(SocketInputStream.java:217) > >> at > >> > org.globus.gsi.gssapi.net.GssInputStream.available(GssInputStream.java:107) > >> > >> at > >> > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel.step(AbstractStreamKarajanChannel.java:113) > >> at > >> > org.globus.cog.karajan.workflow.service.channels.AbstractStreamKarajanChannel$Multiplexer.run(AbstractStreamKarajanChannel.java:365) > >> > >> Progress: Finished successfully:7 > >> Progress: Active:1 Finished successfully:7 > >> Progress: Active:1 Finished successfully:7 > >> Progress: Active:1 Finished successfully:7 > >> Progress: Checking status:1 Finished successfully:7 > >> Progress: Finished successfully:8 > >> Progress: Finished successfully:8 > >> Progress: Finished successfully:8 > >> Progress: Finished successfully:8 > >> Progress: Finished successfully:8 > >> Progress: Finished successfully:8 > >> Progress: Finished successfully:8 > >> > >> Wenjun > >> > >>> was: Re: notes from todays meeting > >>> > >>> Hi Aashish, > >>> > >>> Wenjun and Tom are integrated the latest OOPS scripts into the > >>> > >> portal for Web execution. > >> > >>> Wenjun is getting errors, as below. I suspect he's missing some > >>> > >> parameters or has incorrect parameters or inputs. > >> > >>> Can you send to Wenjun the latest parameters (ie shell calling > >>> > >> examples) to run Loops, RaptorLoops, and RaptorLoops with prep > stage? > >> > >>> Best thing to do is quickly update README with the lastest shell > >>> > >> invocation lines and check it in; then Wenjun can verify that the > >> latest documented invocation instructions work for other people > (which > >> will be useful for the OOPS group too!) > >> > >>> I cant get to this till late today or early this weekend, so any > >>> > >> help you can offer will be great. > >> > >>> Thanks! > >>> > >>> - Mike > >>> > >>> ----- "wenjun wu" wrote: > >>> > >>> > >>> > >>>> Hi Mike, > >>>> I run the raptorloop.sh and got the following error. Any > clue? > >>>> Wenjun > >>>> > >>>> [wwj at login1 wwjtest]$ run.raptorloops.sh -target T1af7 -prepTar > >>>> T1af7.prep.tar.gz -templatesPerJob 800 > >>>> Running in > >>>> > >>>> > >> > /gpfs/pads/oops/scienceportal/oops-svn/oops/protlib2-0422/wwjtest/run.raptorloops.9229 > >> > >>>> Running RaptorLoops with settings: target=T1af7 seqFile= > >>>> prepTar=T1af7.prep.tar.gz templatesPerJob=800 templateList= > >>>> > >> nModels= > >> > >>>> nSim=4 execsite=localhost maxSlots=16 resume= rlog= > >>>> Running from host with compute-node reachable address of > >>>> > >> 172.5.86.5 > >> > >>>> protlib2 home is > >>>> /gpfs/pads/oops/scienceportal/oops-svn/oops/protlib2-0422 > >>>> cp: warning: source file > >>>> > >>>> > >> > `/gpfs/pads/oops/scienceportal/oops-svn/oops/protlib2-0422/swift/RaptorOut.map' > >> > >>>> specified more than once > >>>> cp: warning: source file > >>>> > >>>> > >> > `/gpfs/pads/oops/scienceportal/oops-svn/oops/protlib2-0422/swift/TemplateList.map' > >> > >>>> specified more than once > >>>> cp: missing destination file operand after `.' > >>>> Try `cp --help' for more information. > >>>> basename: missing operand > >>>> Try `basename --help' for more information. > >>>> Variable nModels defined in scope 7122710 shadows variable of > same > >>>> name > >>>> in scope 4890830 > >>>> Variable tseg defined in scope 26460367 shadows variable of same > >>>> > >> name > >> > >>>> in > >>>> scope 4890830 > >>>> Variable preparedInput defined in scope 26460367 shadows > variable > >>>> > >> of > >> > >>>> same name in scope 4890830 > >>>> Variable nModels defined in scope 26460367 shadows variable of > >>>> > >> same > >> > >>>> name > >>>> in scope 4890830 > >>>> Variable targetId defined in scope 12182618 shadows variable of > >>>> > >> same > >> > >>>> name in scope 4890830 > >>>> Variable modelIn defined in scope 12182618 shadows variable of > >>>> > >> same > >> > >>>> name > >>>> in scope 4890830 > >>>> Variable targetId defined in scope 21925102 shadows variable of > >>>> > >> same > >> > >>>> name in scope 4890830 > >>>> Variable models defined in scope 21925102 shadows variable of > same > >>>> name > >>>> in scope 4890830 > >>>> Swift svn swift-r3246 cog-r2721 > >>>> > >>>> RunID: 20100422-1609-aqv1y329 > >>>> Progress: > >>>> Execution failed: > >>>> java.lang.NumberFormatException: For input string: "" > >>>> > >>>> > >>>> > >>>>> Wenjun, > >>>>> > >>>>> The first two we need are psim.loops.swift and > RaptorLoops.swift, > >>>>> > >>>>> > >>>> and their corresponding runs scripts. > >>>> > >>>> > >>>>> We run them from the corresponding .sh sripts in scripts/run > >>>>> > >>>>> I'll get back to you on this tonight with more details...after > I > >>>>> > >>>>> > >>>> look for my 3rd script which is RaptorLoops with an addiitonal > >>>> pre-process step that takes a raw fasta file as input. I may > need > >>>> > >> to > >> > >>>> check that in from my workspace. > >>>> > >>>> > >>>>> - Mike > >>>>> > >>>>> > >>>>> ----- "wenjun wu" wrote: > >>>>> > >>>>> > >>>>> > >>>>> > >>>>>> Hi Mike: > >>>>>> I installed the latest version of protlib from SVN. I'd > >>>>>> > >> like > >> > >>>>>> > >>>>>> > >>>> to > >>>> > >>>> > >>>>>> clarify which swift scripts are needed into the portal. > >>>>>> > >>>>>> These are the swift scripts in the latest protlib2: > >>>>>> > >>>>>> rw-r--r-- 1 wwj ci-users 737 Apr 22 11:48 > SwiftLib.swift > >>>>>> -rw-r--r-- 1 wwj ci-users 3237 Apr 22 11:48 > psim.itfixex2.swift > >>>>>> -rw-r--r-- 1 wwj ci-users 2127 Apr 22 11:48 > psim.itfixex1.swift > >>>>>> -rwxr-xr-x 1 wwj ci-users 509 Apr 22 11:48 > psim.basicex1.swift > >>>>>> -rw-r--r-- 1 wwj ci-users 2616 Apr 22 11:48 > BoostThreader.swift > >>>>>> -rw-r--r-- 1 wwj ci-users 1477 Apr 22 11:48 LoopLib.swift > >>>>>> -rw-r--r-- 1 wwj ci-users 1193 Apr 22 11:48 > >>>>>> > >>>>>> > >>>> BoostThreaderLib.swift > >>>> > >>>> > >>>>>> -rw-r--r-- 1 wwj ci-users 8869 Apr 22 11:48 oops.swift > >>>>>> -rw-r--r-- 1 wwj ci-users 1525 Apr 22 11:48 > psim.sweepex1.swift > >>>>>> -rwxr-xr-x 1 wwj ci-users 2188 Apr 22 11:48 psim.swift > >>>>>> -rw-r--r-- 1 wwj ci-users 2933 Apr 22 11:48 psim.loops.swift > >>>>>> -rw-r--r-- 1 wwj ci-users 6820 Apr 22 11:48 > >>>>>> RaptorLoops.hanging.swift > >>>>>> -rw-r--r-- 1 wwj ci-users 2943 Apr 22 11:48 RaptorLoops.swift > >>>>>> > >>>>>> I guess the right swift scripts should be: psim.loops, > >>>>>> > >>>>>> > >>>> BoostThreader > >>>> > >>>> > >>>>>> and RaptorLoop. > >>>>>> I need to create packages for both Raptor-BoostThreader > and > >>>>>> RaptorLoop > >>>>>> by grouping swift scripts and mapper scripts. > >>>>>> > >>>>>> > >>>>>> Wenjun > >>>>>> > >>>>>> > >>>>>> > >>>>>>> DataPort 2010.0421 > >>>>>>> > >>>>>>> Coaster proxy issue: can Mihael automate this? > >>>>>>> > >>>>>>> Coaster proxy issue - use long proxy for now. > >>>>>>> > >>>>>>> Swift run status reporter? > >>>>>>> > >>>>>>> Adding new scripts and forms > >>>>>>> - how to shape the args? Like the email form? > >>>>>>> > >>>>>>> Need automation just for caps requests, then manual for > Aashish > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>> tests, then portal for Carl, Tobin et al > >>>>>> > >>>>>> > >>>>>> > >>>>>>> Email notification > >>>>>>> > >>>>>>> Control over which swift the portal is running > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>> > >>>>> > >>> > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Wed Apr 28 23:54:51 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 28 Apr 2010 23:54:51 -0500 (CDT) Subject: [Swift-devel] Problem with incorrect host cert DN in coaster GSI authentication Message-ID: <20240563.728271272516891856.JavaMail.root@zimbra> Mihael, Can you post an update on Yi's problem in getting coasters running over Nimbus/AWS? Easy to fix or hard? Should he try SSH for the coaster launch? (jobmanager=ssh:pbs ???) Thanks, Mike From hategan at mcs.anl.gov Thu Apr 29 10:48:11 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 29 Apr 2010 10:48:11 -0500 Subject: [Swift-devel] Re: Problem with incorrect host cert DN in coaster GSI authentication In-Reply-To: <20240563.728271272516891856.JavaMail.root@zimbra> References: <20240563.728271272516891856.JavaMail.root@zimbra> Message-ID: <1272556091.17222.1.camel@localhost> The host cert isn't incorrect. It's GSI with its silly reverse lookup that causes things to fail. gt2:pbs should work (assuming the pbs provider does). On Wed, 2010-04-28 at 23:54 -0500, Michael Wilde wrote: > Mihael, > > Can you post an update on Yi's problem in getting coasters running over Nimbus/AWS? > Easy to fix or hard? > > Should he try SSH for the coaster launch? (jobmanager=ssh:pbs ???) > > Thanks, > > Mike From wilde at mcs.anl.gov Thu Apr 29 10:57:01 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 29 Apr 2010 10:57:01 -0500 (CDT) Subject: [Swift-devel] Re: Problem with incorrect host cert DN in coaster GSI authentication In-Reply-To: <1272556091.17222.1.camel@localhost> Message-ID: <14586782.739691272556621922.JavaMail.root@zimbra> OK, thanks. Its not clear to me exactly whats happening, but I get the high-level idea that it relates to trust relationships that get broken because of differences in DN settings and/or interpretations. Yi, can you try gt2:pbs? Mihael, at some point can you post a note explaining the issues? I think we need to document or automate/fix the various interactions between coasters and GSI: - this new issue/restriction with gt2:gt2:pbs - the GSI needs and user config procedures for ssh:pbs Thanks, Mike ----- "Mihael Hategan" wrote: > The host cert isn't incorrect. It's GSI with its silly reverse lookup > that causes things to fail. > > gt2:pbs should work (assuming the pbs provider does). > > On Wed, 2010-04-28 at 23:54 -0500, Michael Wilde wrote: > > Mihael, > > > > Can you post an update on Yi's problem in getting coasters running > over Nimbus/AWS? > > Easy to fix or hard? > > > > Should he try SSH for the coaster launch? (jobmanager=ssh:pbs ???) > > > > Thanks, > > > > Mike -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Thu Apr 29 11:18:08 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 29 Apr 2010 11:18:08 -0500 Subject: [Swift-devel] Re: Problem with incorrect host cert DN in coaster GSI authentication In-Reply-To: <14586782.739691272556621922.JavaMail.root@zimbra> References: <14586782.739691272556621922.JavaMail.root@zimbra> Message-ID: <1272557888.17680.15.camel@localhost> On Thu, 2010-04-29 at 10:57 -0500, Michael Wilde wrote: > OK, thanks. Its not clear to me exactly whats happening, but I get the > high-level idea that it relates to trust relationships that get broken > because of differences in DN settings and/or interpretations. No. It's something that someone while writing up GSI thought was going to make things easier. Well, it doesn't and it makes things unsecure. But once in, it never changed. Normally, when you connect to bankofamerica.com, the browser resolves that name to an IP, contacts that IP, gets a certificate and checks the DN against the name you typed. In GSI, when you connect to bankofamerica.com, the browser resolves that name to an IP, contacts that IP, gets a certificate, does a reverse-resolution on that IP and then checks the DN of the cert against the reverse-resolved name of the IP. That reverse-resolved name may not be bankofamerica.com. This was done to provide easy (for the sysadmin) ways of having multiple DNS entries be used with the same machine. The problem is that it also fails for some scenarios (like the one we have). Not only that, it is an abomination in terms of security since impersonating a service can now be done with DNS hacks instead of the more difficult schemes involving cracking RSA/DSA. > > Yi, can you try gt2:pbs? > > Mihael, at some point can you post a note explaining the issues? > > I think we need to document or automate/fix the various interactions between coasters and GSI: > > - this new issue/restriction with gt2:gt2:pbs > - the GSI needs and user config procedures for ssh:pbs > > Thanks, > > Mike > > ----- "Mihael Hategan" wrote: > > > The host cert isn't incorrect. It's GSI with its silly reverse lookup > > that causes things to fail. > > > > gt2:pbs should work (assuming the pbs provider does). > > > > On Wed, 2010-04-28 at 23:54 -0500, Michael Wilde wrote: > > > Mihael, > > > > > > Can you post an update on Yi's problem in getting coasters running > > over Nimbus/AWS? > > > Easy to fix or hard? > > > > > > Should he try SSH for the coaster launch? (jobmanager=ssh:pbs ???) > > > > > > Thanks, > > > > > > Mike > From wilde at mcs.anl.gov Thu Apr 29 16:36:08 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 29 Apr 2010 16:36:08 -0500 (CDT) Subject: [Swift-devel] Fwd: Swift run failed in portal In-Reply-To: <4BD9F99D.9010002@mcs.anl.gov> Message-ID: <15373846.753931272576968860.JavaMail.root@zimbra> Mihael, just a note that Wenjun could use these patches committed. Note that I did *not* have time to try the language tests, which ideally should be done before committing. - Mike ----- Forwarded Message ----- From: "wenjun wu" To: "Michael Wilde" Sent: Thursday, April 29, 2010 4:26:53 PM GMT -06:00 US/Canada Central Subject: Re: Swift run failed in portal Hi Mike, You version of swift works fine with RaptorLoops. But I got some compilation issue after doing the two patches. Wenjun [javac] Compiling 345 source files to /autonfs/gpfs-pads/projects/CI-MCB000009/scienceportal/swift-branch/cog/modules/swift/build [javac] /autonfs/gpfs-pads/projects/CI-MCB000009/scienceportal/swift-branch/cog/modules/swift/src/org/globus/swift/catalog/Catalog.java:24: warning: unmappable character for encoding UTF8 [javac] * @author Jens-S. V?ckler [javac] ^ [javac] /autonfs/gpfs-pads/projects/CI-MCB000009/scienceportal/swift-branch/cog/modules/swift/src/org/globus/swift/catalog/CatalogEntry.java:21: warning: unmappable character for encoding UTF8 [javac] * @author Jens-S. V?ckler [javac] ^ [javac] /autonfs/gpfs-pads/projects/CI-MCB000009/scienceportal/swift-branch/cog/modules/swift/src/org/globus/swift/catalog/util/Escape.java:36: warning: unmappable character for encoding UTF8 [javac] * @author Jens-S. V?ckler [javac] ^ [javac] /autonfs/gpfs-pads/projects/CI-MCB000009/scienceportal/swift-branch/cog/modules/swift/src/org/globus/swift/catalog/util/ProfileParserException.java:23: warning: unmappable character for encoding UTF8 [javac] * @author Jens-S. V?ckler [javac] ^ [javac] /autonfs/gpfs-pads/projects/CI-MCB000009/scienceportal/swift-branch/cog/modules/swift/src/org/globus/swift/catalog/util/Separator.java:25: warning: unmappable character for encoding UTF8 [javac] * @author Jens-S. V?ckler [javac] ^ [javac] /autonfs/gpfs-pads/projects/CI-MCB000009/scienceportal/swift-branch/cog/modules/swift/src/org/griphyn/vdl/mapping/AbstractDataNode.java:543: closeDeepStructure() is already defined in org.griphyn.vdl.mapping.AbstractDataNode [javac] public void closeDeepStructure() { [javac] ^ [javac] /autonfs/gpfs-pads/projects/CI-MCB000009/scienceportal/swift-branch/cog/modules/swift/src/org/griphyn/vdl/karajan/lib/SetFieldValue.java:51: cannot find symbol [javac] symbol : method getLast() [javac] location: class org.griphyn.vdl.mapping.Path [javac] markAsAvailable(stack, leaf.getParent(), leaf.getPathFromRoot().getLast()); [javac] ^ [javac] Note: Some input files use or override a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] 2 errors [javac] 5 warnings > Wenjun, did the patched Swift release solve this problem? > > Thanks, > > Mike > > ----- "wenjun wu" wrote: > > >> It is located at >> /gpfs/pads/oops/scienceportal/apache-tomcat/webapps/SIDGridPortal/workflow/20100421/oops-20100421-1638-5up59v27 >> >> Wenjun >> >>> Wenjun, Tom, >>> >>> The run I started during our meeting failed: >>> >>> Apr 21 16:38 - 5up59v27 Failed >>> >>> How do I tell why? >>> >>> Is there at least a server-side run directory that the user could >>> >> navigate to within the CI filespace? >> >>> Thanks, >>> >>> Mike >>> >>> > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Thu Apr 29 16:37:58 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 29 Apr 2010 16:37:58 -0500 (CDT) Subject: [Swift-devel] How long do idle coaster workers linger? Message-ID: <20453294.754031272577078702.JavaMail.root@zimbra> How long do coaster workers (or blocks?) idle before they are shut down? -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From aespinosa at cs.uchicago.edu Thu Apr 29 16:52:17 2010 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Thu, 29 Apr 2010 16:52:17 -0500 Subject: [Swift-devel] Fwd: Swift run failed in portal In-Reply-To: <15373846.753931272576968860.JavaMail.root@zimbra> References: <4BD9F99D.9010002@mcs.anl.gov> <15373846.753931272576968860.JavaMail.root@zimbra> Message-ID: I got these errors from the stable branch as well. I think I posted this one in another thread. -Allan 2010/4/29 Michael Wilde : > Mihael, just a note that Wenjun could use these patches committed. > Note that I did *not* have time to try the language tests, which ideally should be done before committing. > > - Mike > > ----- Forwarded Message ----- > From: "wenjun wu" > To: "Michael Wilde" > Sent: Thursday, April 29, 2010 4:26:53 PM GMT -06:00 US/Canada Central > Subject: Re: Swift run failed in portal > > Hi Mike, > ? You version of swift works fine with RaptorLoops. But I got some > compilation issue after doing the two patches. > > Wenjun > > ?[javac] Compiling 345 source files to > /autonfs/gpfs-pads/projects/CI-MCB000009/scienceportal/swift-branch/cog/modules/swift/build > ? ? [javac] > /autonfs/gpfs-pads/projects/CI-MCB000009/scienceportal/swift-branch/cog/modules/swift/src/org/globus/swift/catalog/Catalog.java:24: > warning: unmappable character for encoding UTF8 > ? ? [javac] ?* @author Jens-S. V?ckler > ? ? [javac] ? ? ? ? ? ? ? ? ? ? ^ > ? ? [javac] > /autonfs/gpfs-pads/projects/CI-MCB000009/scienceportal/swift-branch/cog/modules/swift/src/org/globus/swift/catalog/CatalogEntry.java:21: > warning: unmappable character for encoding UTF8 > ? ? [javac] ?* @author Jens-S. V?ckler > ? ? [javac] ? ? ? ? ? ? ? ? ? ? ^ > ? ? [javac] > /autonfs/gpfs-pads/projects/CI-MCB000009/scienceportal/swift-branch/cog/modules/swift/src/org/globus/swift/catalog/util/Escape.java:36: > warning: unmappable character for encoding UTF8 > ? ? [javac] ?* @author Jens-S. V?ckler > ? ? [javac] ? ? ? ? ? ? ? ? ? ? ^ > ? ? [javac] > /autonfs/gpfs-pads/projects/CI-MCB000009/scienceportal/swift-branch/cog/modules/swift/src/org/globus/swift/catalog/util/ProfileParserException.java:23: > warning: unmappable character for encoding UTF8 > ? ? [javac] ?* @author Jens-S. V?ckler > ? ? [javac] ? ? ? ? ? ? ? ? ? ? ^ > ? ? [javac] > /autonfs/gpfs-pads/projects/CI-MCB000009/scienceportal/swift-branch/cog/modules/swift/src/org/globus/swift/catalog/util/Separator.java:25: > warning: unmappable character for encoding UTF8 > ? ? [javac] ?* @author Jens-S. V?ckler > ? ? [javac] ? ? ? ? ? ? ? ? ? ? ^ > ? ? [javac] > /autonfs/gpfs-pads/projects/CI-MCB000009/scienceportal/swift-branch/cog/modules/swift/src/org/griphyn/vdl/mapping/AbstractDataNode.java:543: > closeDeepStructure() is already defined in > org.griphyn.vdl.mapping.AbstractDataNode > ? ? [javac] ? ? public void closeDeepStructure() { > ? ? [javac] ? ? ? ? ? ? ? ? ? ? ^ > ? ? [javac] > /autonfs/gpfs-pads/projects/CI-MCB000009/scienceportal/swift-branch/cog/modules/swift/src/org/griphyn/vdl/karajan/lib/SetFieldValue.java:51: > cannot find symbol > ? ? [javac] symbol ?: method getLast() > ? ? [javac] location: class org.griphyn.vdl.mapping.Path > ? ? [javac] ? ? ? ? ? ? ? ? ? ? markAsAvailable(stack, > leaf.getParent(), leaf.getPathFromRoot().getLast()); > > [javac] > ^ > ? ? [javac] Note: Some input files use or override a deprecated API. > ? ? [javac] Note: Recompile with -Xlint:deprecation for details. > ? ? [javac] 2 errors > ? ? [javac] 5 warnings > >> Wenjun, did the patched Swift release solve this problem? >> >> Thanks, >> >> Mike >> >> ----- "wenjun wu" ?wrote: >> >> >>> It is located at >>> /gpfs/pads/oops/scienceportal/apache-tomcat/webapps/SIDGridPortal/workflow/20100421/oops-20100421-1638-5up59v27 >>> >>> Wenjun >>> >>>> Wenjun, Tom, >>>> >>>> The run I started during our meeting failed: >>>> >>>> Apr 21 16:38 - 5up59v27 ?Failed >>>> >>>> How do I tell why? >>>> >>>> Is there at least a server-side run directory that the user could >>>> >>> navigate to within the CI filespace? >>> >>>> Thanks, >>>> >>>> Mike >>>> > From aespinosa at cs.uchicago.edu Thu Apr 29 16:53:32 2010 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Thu, 29 Apr 2010 16:53:32 -0500 Subject: [Swift-devel] How long do idle coaster workers linger? In-Reply-To: <20453294.754031272577078702.JavaMail.root@zimbra> References: <20453294.754031272577078702.JavaMail.root@zimbra> Message-ID: I thought these were decided by maxtime, then the client/ submit host will send a shutdown command to the bootstrap service to shutdown everything. -Allan 2010/4/29 Michael Wilde : > How long do coaster workers (or blocks?) idle before they are shut down? > > From wilde at mcs.anl.gov Thu Apr 29 17:05:52 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 29 Apr 2010 17:05:52 -0500 (CDT) Subject: [Swift-devel] How long do idle coaster workers linger? In-Reply-To: Message-ID: <9728601.755011272578752244.JavaMail.root@zimbra> Maxtime limits primarily the walltime request to the local scheduler for the coaster block. But in addition, I think the workers are shut down before that maxtime expires if there is no more work for them. Thats the aspect I was asking about. - Mike ----- "Allan Espinosa" wrote: > I thought these were decided by maxtime, then the client/ submit host > will send a shutdown command to the bootstrap service to shutdown > everything. > > -Allan > > 2010/4/29 Michael Wilde : > > How long do coaster workers (or blocks?) idle before they are shut > down? > > > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From yizhu at cs.uchicago.edu Thu Apr 29 17:34:44 2010 From: yizhu at cs.uchicago.edu (Yi Zhu) Date: Thu, 29 Apr 2010 17:34:44 -0500 Subject: [Swift-devel] Re: Problem with incorrect host cert DN in coaster GSI authentication In-Reply-To: <1272557888.17680.15.camel@localhost> References: <14586782.739691272556621922.JavaMail.root@zimbra> <1272557888.17680.15.camel@localhost> Message-ID: <4BDA0984.4070604@cs.uchicago.edu> HI, I've tried it with "gt2:pbs", and got a "qsub not found" error, for further investigation, I pulled the /env/ used by globus,and found that there is no "/opt/torque-2.3.6/bin/qsub" under the PATH= ,I think that's why cause "qsub not found" problem. Any suggested solution ? Many thanks! -Yi Zhu Swift screen dump: Thu Apr 29 17:25:22 CDT 2010 -bash-3.2$ swift -tc.file tc.test.data -sites.file sshpbscoast.xml first.swift Swift svn swift-r3262 cog-r2729 (cog modified locally) RunID: 20100429-1725-16xmtae7 Progress: Progress: Stage in:1 Progress: Submitted:1 Failed to transfer wrapper log from first-20100429-1725-16xmtae7/info/9 on ec2 Progress: Failed:1 Execution failed: Exception in echo: Arguments: [Hello, world!] Host: ec2 Directory: first-20100429-1725-16xmtae7/jobs/9/echo-91j9i9rj stderr.txt: stdout.txt: ---- Caused by: Task failed: Error submitting block task org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Cannot submit job: java.io.IOException: qsub: not found at org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:63) at org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) at org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) at org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.run(BlockTaskSubmitter.java:66) Caused by: java.io.IOException: java.io.IOException: qsub: not found at java.lang.UNIXProcess.(UNIXProcess.java:148) at java.lang.ProcessImpl.start(ProcessImpl.java:65) at java.lang.ProcessBuilder.start(ProcessBuilder.java:451) at java.lang.Runtime.exec(Runtime.java:591) at org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:89) at org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:53) ... 3 more Cleaning up... Shutting down service at https://10.251.214.179:59447 Got channel MetaChannel: 1535747955 -> GSSSChannel-02065467484(1) Command(3, SHUTDOWNSERVICE): handling reply timeout; sendReqTime=100429-172549.902, sendTime=100429-172549.903, now=100429-172559.908 - Done Env pulled from remote: -bash-3.2$ globus-job-run ec2-204-236-204-71.compute-1.amazonaws.com /bin/env [...] PATH=/opt/vdt-1.10.1/gums/scripts:/opt/vdt-1.10.1/prima/bin:/opt/vdt-1.10.1/cert-scripts/bin:/opt/vdt-1.10.1/glite/sbin:/opt/vdt-1.10.1/glite/bin:/opt/vdt-1.10.1/jdk1.5/bin:/opt/vdt-1.10.1/edg/sbin:/opt/vdt-1.10.1/gip/bin:/opt/vdt-1.10.1/gpt/sbin:/opt/vdt-1.10.1/globus/bin:/opt/vdt-1.10.1/globus/sbin:/opt/vdt-1.10.1/wget/bin:/opt/vdt-1.10.1/logrotate/sbin:/opt/vdt-1.10.1/perl/bin:/opt/pacman-3.26/bin:/opt/vdt-1.10.1/vdt/sbin:/opt/vdt-1.10.1/vdt/bin:/opt/vdt-1.10.1/gums/scripts:/opt/vdt-1.10.1/prima/bin:/opt/vdt-1.10.1/cert-scripts/bin:/opt/vdt-1.10.1/glite/sbin:/opt/vdt-1.10.1/glite/bin:/opt/vdt-1.10.1/jdk1.5/bin:/opt/vdt-1.10.1/edg/sbin:/opt/vdt-1.10.1/gip/bin:/opt/vdt-1.10.1/gpt/sbin:/opt/vdt-1.10.1/wget/bin:/opt/vdt-1.10.1/logrotate/sbin:/opt/vdt-1.10.1/perl/bin:/opt/pacman-3.26/bin:/opt/vdt-1.10.1/vdt/sbin:/opt/vdt-1.10.1/vdt/bin:/sbin:/usr/sbin:/bin:/usr/bin:/usr/X11R6/bin PERL5LIB=/opt/vdt-1.10.1/vdt/lib:/opt/vdt-1.10.1/perl/lib/5.8.0:/opt/vdt-1.10.1/perl/lib/5.8.0/i686-linux:/opt/vdt-1.10.1/perl/lib/site_perl/5.8.0:/opt/vdt-1.10.1/perl/lib/site_perl/5.8.0/i686-linux:/opt/vdt-1.10.1/vdt/lib:/opt/vdt-1.10.1/perl/lib/5.8.0:/opt/vdt-1.10.1/perl/lib/5.8.0/i686-linux-thread-multi:/opt/vdt-1.10.1/perl/lib/site_perl/5.8.0:/opt/vdt-1.10.1/perl/lib/site_perl/5.8.0/i686-linux-thread-multi::/opt/vdt-1.10.1/perl/lib/5.8.8:/opt/vdt-1.10.1/perl/lib/site_perl:/opt/vdt-1.10.1/perl/lib/5.8.8:/opt/vdt-1.10.1/perl/lib/site_perl X509_USER_PROXY=/home/torqueuser/.globus/job/ec2-204-236-204-71.compute-1.amazonaws.com/24453.1272579976/x509_up -bash-3.2$ compare to env on remote machine: [torqueuser at ip-10-251-214-179 ~]$ env [...] PATH=/opt/torque-2.3.6/bin/:/opt/torque-2.3.6/sbin:/opt/vdt-1.10.1/gums/scripts:/opt/vdt-1.10.1/prima/bin:/opt/vdt-1.10.1/cert-scripts/bin:/opt/vdt-1.10.1/glite/sbin:/opt/vdt-1.10.1/glite/bin:/opt/vdt-1.10.1/jdk1.5/bin:/opt/vdt-1.10.1/edg/sbin:/opt/vdt-1.10.1/gip/bin:/opt/vdt-1.10.1/gpt/sbin:/opt/vdt-1.10.1/globus/bin:/opt/vdt-1.10.1/globus/sbin:/opt/vdt-1.10.1/wget/bin:/opt/vdt-1.10.1/logrotate/sbin:/opt/vdt-1.10.1/perl/bin:/opt/pacman-3.26/bin:/opt/vdt-1.10.1/vdt/sbin:/opt/vdt-1.10.1/vdt/bin:/opt/pacman-3.26/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/home/torqueuser/bin [..] [torqueuser at ip-10-251-214-179 ~]$ On 4/29/2010 11:18 AM, Mihael Hategan wrote: > On Thu, 2010-04-29 at 10:57 -0500, Michael Wilde wrote: > >> OK, thanks. Its not clear to me exactly whats happening, but I get the >> high-level idea that it relates to trust relationships that get broken >> because of differences in DN settings and/or interpretations. >> > No. It's something that someone while writing up GSI thought was going > to make things easier. Well, it doesn't and it makes things unsecure. > But once in, it never changed. > > Normally, when you connect to bankofamerica.com, the browser resolves > that name to an IP, contacts that IP, gets a certificate and checks the > DN against the name you typed. > > In GSI, when you connect to bankofamerica.com, the browser resolves that > name to an IP, contacts that IP, gets a certificate, does a > reverse-resolution on that IP and then checks the DN of the cert against > the reverse-resolved name of the IP. That reverse-resolved name may not > be bankofamerica.com. > > This was done to provide easy (for the sysadmin) ways of having multiple > DNS entries be used with the same machine. The problem is that it also > fails for some scenarios (like the one we have). Not only that, it is an > abomination in terms of security since impersonating a service can now > be done with DNS hacks instead of the more difficult schemes involving > cracking RSA/DSA. > > >> Yi, can you try gt2:pbs? >> >> Mihael, at some point can you post a note explaining the issues? >> >> I think we need to document or automate/fix the various interactions between coasters and GSI: >> >> - this new issue/restriction with gt2:gt2:pbs >> - the GSI needs and user config procedures for ssh:pbs >> >> Thanks, >> >> Mike >> >> ----- "Mihael Hategan" wrote: >> >> >>> The host cert isn't incorrect. It's GSI with its silly reverse lookup >>> that causes things to fail. >>> >>> gt2:pbs should work (assuming the pbs provider does). >>> >>> On Wed, 2010-04-28 at 23:54 -0500, Michael Wilde wrote: >>> >>>> Mihael, >>>> >>>> Can you post an update on Yi's problem in getting coasters running >>>> >>> over Nimbus/AWS? >>> >>>> Easy to fix or hard? >>>> >>>> Should he try SSH for the coaster launch? (jobmanager=ssh:pbs ???) >>>> >>>> Thanks, >>>> >>>> Mike >>>> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Thu Apr 29 17:45:24 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 29 Apr 2010 17:45:24 -0500 (CDT) Subject: [Swift-devel] Re: Problem with incorrect host cert DN in coaster GSI authentication In-Reply-To: <4BDA0984.4070604@cs.uchicago.edu> Message-ID: <3394441.756201272581124519.JavaMail.root@zimbra> Yi, from where and to where were you running? If the "to" is a Nimbus workspace in AWS, I am assuming that with provider-coasters and jobmanager=gt2:pbs, what happens is this: Swift on the submit host sends a gt2 job to the Nimbus head node that job runs on the ID that your client-side proxy cert is mapped to that login on the nimbus headnode should have qsub in its PATH (You can test this with globus-job-run of something like /bin/sh -c "which qsub") If there's some issue with things like .profile execution, etc, to get /opt/torque into the remote PATH, perhaps on your workspace headnode you can link qsub to /usr/bin or similar?) You'll need to experiment, unless between Mihael and the Nimbus team someone can provide a definitive answer on what options you have for getting the remote qsub into the headnode's PATH for a GT2 job). - Mike ----- "Yi Zhu" wrote: > HI, > > I've tried it with "gt2:pbs", and got a "qsub not found" error, for > further investigation, I pulled the env used by globus,and found that > there is no "/opt/torque-2.3.6/bin/qsub" under the PATH= ,I think > that's why cause "qsub not found" problem. > > Any suggested solution ? > > > Many thanks! > > -Yi Zhu > > > Swift screen dump: > > Thu Apr 29 17:25:22 CDT 2010 > -bash-3.2$ swift -tc.file tc.test.data -sites.file sshpbscoast.xml > first.swift > Swift svn swift-r3262 cog-r2729 (cog modified locally) > > RunID: 20100429-1725-16xmtae7 > Progress: > Progress: Stage in:1 > Progress: Submitted:1 > Failed to transfer wrapper log from > first-20100429-1725-16xmtae7/info/9 on ec2 > Progress: Failed:1 > Execution failed: > Exception in echo: > Arguments: [Hello, world!] > Host: ec2 > Directory: first-20100429-1725-16xmtae7/jobs/9/echo-91j9i9rj > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > Task failed: Error submitting block task > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Cannot submit job: java.io.IOException: qsub: not found > at > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:63) > at > org.globus.cog.abstraction.impl.common.AbstractTaskHandler.submit(AbstractTaskHandler.java:46) > at > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.submit(ExecutionTaskHandler.java:50) > at > org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.run(BlockTaskSubmitter.java:66) > Caused by: java.io.IOException: java.io.IOException: qsub: not found > at java.lang.UNIXProcess.(UNIXProcess.java:148) > at java.lang.ProcessImpl.start(ProcessImpl.java:65) > at java.lang.ProcessBuilder.start(ProcessBuilder.java:451) > at java.lang.Runtime.exec(Runtime.java:591) > at > org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.start(AbstractExecutor.java:89) > at > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.submit(AbstractJobSubmissionTaskHandler.java:53) > ... 3 more > > Cleaning up... > Shutting down service at https://10.251.214.179:59447 > Got channel MetaChannel: 1535747955 -> GSSSChannel-02065467484(1) > Command(3, SHUTDOWNSERVICE): handling reply timeout; > sendReqTime=100429-172549.902, sendTime=100429-172549.903, > now=100429-172559.908 > - Done > > Env pulled from remote: > > -bash-3.2$ globus-job-run ec2-204-236-204-71.compute-1.amazonaws.com > /bin/env > [...] > PATH=/opt/vdt-1.10.1/gums/scripts:/opt/vdt-1.10.1/prima/bin:/opt/vdt-1.10.1/cert-scripts/bin:/opt/vdt-1.10.1/glite/sbin:/opt/vdt-1.10.1/glite/bin:/opt/vdt-1.10.1/jdk1.5/bin:/opt/vdt-1.10.1/edg/sbin:/opt/vdt-1.10.1/gip/bin:/opt/vdt-1.10.1/gpt/sbin:/opt/vdt-1.10.1/globus/bin:/opt/vdt-1.10.1/globus/sbin:/opt/vdt-1.10.1/wget/bin:/opt/vdt-1.10.1/logrotate/sbin:/opt/vdt-1.10.1/perl/bin:/opt/pacman-3.26/bin:/opt/vdt-1.10.1/vdt/sbin:/opt/vdt-1.10.1/vdt/bin:/opt/vdt-1.10.1/gums/scripts:/opt/vdt-1.10.1/prima/bin:/opt/vdt-1.10.1/cert-scripts/bin:/opt/vdt-1.10.1/glite/sbin:/opt/vdt-1.10.1/glite/bin:/opt/vdt-1.10.1/jdk1.5/bin:/opt/vdt-1.10.1/edg/sbin:/opt/vdt-1.10.1/gip/bin:/opt/vdt-1.10.1/gpt/sbin:/opt/vdt-1.10.1/wget/bin:/opt/vdt-1.10.1/logrotate/sbin:/opt/vdt-1.10.1/perl/bin:/opt/pacman-3.26/bin:/opt/vdt-1.10.1/vdt/sbin:/opt/vdt-1.10.1/vdt/bin:/sbin:/usr/sbin:/bin:/usr/bin:/usr/X11R6/bin > PERL5LIB=/opt/vdt-1.10.1/vdt/lib:/opt/vdt-1.10.1/perl/lib/5.8.0:/opt/vdt-1.10.1/perl/lib/5.8.0/i686-linux:/opt/vdt-1.10.1/perl/lib/site_perl/5.8.0:/opt/vdt-1.10.1/perl/lib/site_perl/5.8.0/i686-linux:/opt/vdt-1.10.1/vdt/lib:/opt/vdt-1.10.1/perl/lib/5.8.0:/opt/vdt-1.10.1/perl/lib/5.8.0/i686-linux-thread-multi:/opt/vdt-1.10.1/perl/lib/site_perl/5.8.0:/opt/vdt-1.10.1/perl/lib/site_perl/5.8.0/i686-linux-thread-multi::/opt/vdt-1.10.1/perl/lib/5.8.8:/opt/vdt-1.10.1/perl/lib/site_perl:/opt/vdt-1.10.1/perl/lib/5.8.8:/opt/vdt-1.10.1/perl/lib/site_perl > X509_USER_PROXY=/home/torqueuser/.globus/job/ec2-204-236-204-71.compute-1.amazonaws.com/24453.1272579976/x509_up > -bash-3.2$ > > compare to env on remote machine: > > [torqueuser at ip-10-251-214-179 ~]$ env > [...] > PATH=/opt/torque-2.3.6/bin/:/opt/torque-2.3.6/sbin:/opt/vdt-1.10.1/gums/scripts:/opt/vdt-1.10.1/prima/bin:/opt/vdt-1.10.1/cert-scripts/bin:/opt/vdt-1.10.1/glite/sbin:/opt/vdt-1.10.1/glite/bin:/opt/vdt-1.10.1/jdk1.5/bin:/opt/vdt-1.10.1/edg/sbin:/opt/vdt-1.10.1/gip/bin:/opt/vdt-1.10.1/gpt/sbin:/opt/vdt-1.10.1/globus/bin:/opt/vdt-1.10.1/globus/sbin:/opt/vdt-1.10.1/wget/bin:/opt/vdt-1.10.1/logrotate/sbin:/opt/vdt-1.10.1/perl/bin:/opt/pacman-3.26/bin:/opt/vdt-1.10.1/vdt/sbin:/opt/vdt-1.10.1/vdt/bin:/opt/pacman-3.26/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/home/torqueuser/bin > [..] > [torqueuser at ip-10-251-214-179 ~]$ > > > On 4/29/2010 11:18 AM, Mihael Hategan wrote: > > On Thu, 2010-04-29 at 10:57 -0500, Michael Wilde wrote: > > OK, thanks. Its not clear to me exactly whats happening, but I get the > high-level idea that it relates to trust relationships that get broken > because of differences in DN settings and/or interpretations. No. It's > something that someone while writing up GSI thought was going > to make things easier. Well, it doesn't and it makes things unsecure. > But once in, it never changed. > > Normally, when you connect to bankofamerica.com, the browser resolves > that name to an IP, contacts that IP, gets a certificate and checks > the > DN against the name you typed. > > In GSI, when you connect to bankofamerica.com, the browser resolves > that > name to an IP, contacts that IP, gets a certificate, does a > reverse-resolution on that IP and then checks the DN of the cert > against > the reverse-resolved name of the IP. That reverse-resolved name may > not > be bankofamerica.com. > > This was done to provide easy (for the sysadmin) ways of having > multiple > DNS entries be used with the same machine. The problem is that it also > fails for some scenarios (like the one we have). Not only that, it is > an > abomination in terms of security since impersonating a service can now > be done with DNS hacks instead of the more difficult schemes involving > cracking RSA/DSA. > > Yi, can you try gt2:pbs? > > Mihael, at some point can you post a note explaining the issues? > > I think we need to document or automate/fix the various interactions > between coasters and GSI: > > - this new issue/restriction with gt2:gt2:pbs > - the GSI needs and user config procedures for ssh:pbs > > Thanks, > > Mike > > ----- "Mihael Hategan" wrote: > > The host cert isn't incorrect. It's GSI with its silly reverse lookup > that causes things to fail. > > gt2:pbs should work (assuming the pbs provider does). > > On Wed, 2010-04-28 at 23:54 -0500, Michael Wilde wrote: > > Mihael, > > Can you post an update on Yi's problem in getting coasters running > over Nimbus/AWS? > > Easy to fix or hard? > > Should he try SSH for the coaster launch? (jobmanager=ssh:pbs ???) > > Thanks, > > Mike -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Thu Apr 29 19:48:30 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 29 Apr 2010 19:48:30 -0500 Subject: [Swift-devel] deep field bug In-Reply-To: <21102453.724491272497215588.JavaMail.root@zimbra> References: <21102453.724491272497215588.JavaMail.root@zimbra> Message-ID: <1272588510.9099.1.camel@localhost> On Wed, 2010-04-28 at 18:26 -0500, Michael Wilde wrote: > To close this issue down: Mihael provided the following second patch on top of the diff below: > > http://www.mcs.anl.gov/~hategan/deepfieldbug2.patch > > This has been working for me, and appears to completely solve this problem. > > Mihael, can you commit these two patches? Apparently I already have. Two weeks ago: https://trac.ci.uchicago.edu/swift/changeset/3283 I also committed a fix for the compilation problem: https://trac.ci.uchicago.edu/swift/changeset/3289 Let me know if you still see weirdness there. From hategan at mcs.anl.gov Thu Apr 29 18:18:04 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 29 Apr 2010 18:18:04 -0500 Subject: [Swift-devel] Re: Problem with incorrect host cert DN in coaster GSI authentication In-Reply-To: <4BDA0984.4070604@cs.uchicago.edu> References: <14586782.739691272556621922.JavaMail.root@zimbra> <1272557888.17680.15.camel@localhost> <4BDA0984.4070604@cs.uchicago.edu> Message-ID: <1272583084.7581.9.camel@localhost> On Thu, 2010-04-29 at 17:34 -0500, Yi Zhu wrote: > HI, > > I've tried it with "gt2:pbs", and got a "qsub not found" error, for > further investigation, I pulled the env used by globus,and found > that there is no "/opt/torque-2.3.6/bin/qsub" under the PATH= ,I think > that's why cause "qsub not found" problem. > > Any suggested solution ? Two actually. 1. This is for the qsub problem: you can add the relevant environment variables (for Torque) in sites.xml. 2. This is for the DN issue with gt2:gt2:pbs: Edit /etc/hosts and make sure that the expected DN is the first entry for the internal IP passed to the coaster service. If the entry is not in there at all, add it. This is a way to impersonate a Globus service and possibly do a man-in-the-middle thing, but it may also work to fix the DN mismatch problem. Mihael From wilde at mcs.anl.gov Thu Apr 29 18:16:33 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 29 Apr 2010 18:16:33 -0500 (CDT) Subject: [Swift-devel] How long do idle coaster workers linger? In-Reply-To: <1272582798.7581.4.camel@localhost> Message-ID: <8412862.756471272582993434.JavaMail.root@zimbra> Excellent, thanks. That explains what we saw in debugging Wenjun's most recent problem posted here. Long term it might be necessary to separate out these two quantities, but for now this is certainly fine. - Mike ----- "Mihael Hategan" wrote: > On Thu, 2010-04-29 at 16:37 -0500, Michael Wilde wrote: > > How long do coaster workers (or blocks?) idle before they are shut > down? > > > > Idle blocks are shut down after the reserve time (one minute). If by > some reason the connection to the workers is lost, workers also shut > down after some idle time, which is 2 or 4 minutes. -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Thu Apr 29 18:13:18 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 29 Apr 2010 18:13:18 -0500 Subject: [Swift-devel] How long do idle coaster workers linger? In-Reply-To: <20453294.754031272577078702.JavaMail.root@zimbra> References: <20453294.754031272577078702.JavaMail.root@zimbra> Message-ID: <1272582798.7581.4.camel@localhost> On Thu, 2010-04-29 at 16:37 -0500, Michael Wilde wrote: > How long do coaster workers (or blocks?) idle before they are shut down? > Idle blocks are shut down after the reserve time (one minute). If by some reason the connection to the workers is lost, workers also shut down after some idle time, which is 2 or 4 minutes. From hategan at mcs.anl.gov Thu Apr 29 18:10:11 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 29 Apr 2010 18:10:11 -0500 Subject: [Swift-devel] Re: Fwd: Swift run failed in portal In-Reply-To: <15373846.753931272576968860.JavaMail.root@zimbra> References: <15373846.753931272576968860.JavaMail.root@zimbra> Message-ID: <1272582611.7581.1.camel@localhost> Right. Sorry. I will try to commit the patches later today and fix Mr. Voeckler's interference into the swift compilation process. On Thu, 2010-04-29 at 16:36 -0500, Michael Wilde wrote: > Mihael, just a note that Wenjun could use these patches committed. > Note that I did *not* have time to try the language tests, which ideally should be done before committing. > > - Mike > > ----- Forwarded Message ----- > From: "wenjun wu" > To: "Michael Wilde" > Sent: Thursday, April 29, 2010 4:26:53 PM GMT -06:00 US/Canada Central > Subject: Re: Swift run failed in portal > > Hi Mike, > You version of swift works fine with RaptorLoops. But I got some > compilation issue after doing the two patches. > > Wenjun > > [javac] Compiling 345 source files to > /autonfs/gpfs-pads/projects/CI-MCB000009/scienceportal/swift-branch/cog/modules/swift/build > [javac] > /autonfs/gpfs-pads/projects/CI-MCB000009/scienceportal/swift-branch/cog/modules/swift/src/org/globus/swift/catalog/Catalog.java:24: > warning: unmappable character for encoding UTF8 > [javac] * @author Jens-S. V?ckler > [javac] ^ > [javac] > /autonfs/gpfs-pads/projects/CI-MCB000009/scienceportal/swift-branch/cog/modules/swift/src/org/globus/swift/catalog/CatalogEntry.java:21: > warning: unmappable character for encoding UTF8 > [javac] * @author Jens-S. V?ckler > [javac] ^ > [javac] > /autonfs/gpfs-pads/projects/CI-MCB000009/scienceportal/swift-branch/cog/modules/swift/src/org/globus/swift/catalog/util/Escape.java:36: > warning: unmappable character for encoding UTF8 > [javac] * @author Jens-S. V?ckler > [javac] ^ > [javac] > /autonfs/gpfs-pads/projects/CI-MCB000009/scienceportal/swift-branch/cog/modules/swift/src/org/globus/swift/catalog/util/ProfileParserException.java:23: > warning: unmappable character for encoding UTF8 > [javac] * @author Jens-S. V?ckler > [javac] ^ > [javac] > /autonfs/gpfs-pads/projects/CI-MCB000009/scienceportal/swift-branch/cog/modules/swift/src/org/globus/swift/catalog/util/Separator.java:25: > warning: unmappable character for encoding UTF8 > [javac] * @author Jens-S. V?ckler > [javac] ^ > [javac] > /autonfs/gpfs-pads/projects/CI-MCB000009/scienceportal/swift-branch/cog/modules/swift/src/org/griphyn/vdl/mapping/AbstractDataNode.java:543: > closeDeepStructure() is already defined in > org.griphyn.vdl.mapping.AbstractDataNode > [javac] public void closeDeepStructure() { > [javac] ^ > [javac] > /autonfs/gpfs-pads/projects/CI-MCB000009/scienceportal/swift-branch/cog/modules/swift/src/org/griphyn/vdl/karajan/lib/SetFieldValue.java:51: > cannot find symbol > [javac] symbol : method getLast() > [javac] location: class org.griphyn.vdl.mapping.Path > [javac] markAsAvailable(stack, > leaf.getParent(), leaf.getPathFromRoot().getLast()); > > [javac] > ^ > [javac] Note: Some input files use or override a deprecated API. > [javac] Note: Recompile with -Xlint:deprecation for details. > [javac] 2 errors > [javac] 5 warnings > > > Wenjun, did the patched Swift release solve this problem? > > > > Thanks, > > > > Mike > > > > ----- "wenjun wu" wrote: > > > > > >> It is located at > >> /gpfs/pads/oops/scienceportal/apache-tomcat/webapps/SIDGridPortal/workflow/20100421/oops-20100421-1638-5up59v27 > >> > >> Wenjun > >> > >>> Wenjun, Tom, > >>> > >>> The run I started during our meeting failed: > >>> > >>> Apr 21 16:38 - 5up59v27 Failed > >>> > >>> How do I tell why? > >>> > >>> Is there at least a server-side run directory that the user could > >>> > >> navigate to within the CI filespace? > >> > >>> Thanks, > >>> > >>> Mike > >>> > >>> > > > > From yizhu at cs.uchicago.edu Fri Apr 30 01:33:40 2010 From: yizhu at cs.uchicago.edu (Yi Zhu) Date: Fri, 30 Apr 2010 01:33:40 -0500 Subject: [Swift-devel] Re: Problem with incorrect host cert DN in coaster GSI authentication In-Reply-To: <1272583084.7581.9.camel@localhost> References: <14586782.739691272556621922.JavaMail.root@zimbra> <1272557888.17680.15.camel@localhost> <4BDA0984.4070604@cs.uchicago.edu> <1272583084.7581.9.camel@localhost> Message-ID: <4BDA79C4.2040301@cs.uchicago.edu> On 4/29/2010 6:18 PM, Mihael Hategan wrote: > On Thu, 2010-04-29 at 17:34 -0500, Yi Zhu wrote: > >> HI, >> >> I've tried it with "gt2:pbs", and got a "qsub not found" error, for >> further investigation, I pulled the env used by globus,and found >> that there is no "/opt/torque-2.3.6/bin/qsub" under the PATH= ,I think >> that's why cause "qsub not found" problem. >> >> Any suggested solution ? >> > Two actually. > 1. This is for the qsub problem: you can add the relevant environment > variables (for Torque) in sites.xml. > I've tried to add /opt/torque-2.3.6/bin to the sites.xml, but still get the same error;" qsub is not found". make a link from /opt/torque-2.3.6/bin/qsub to /usr/bin seems works, but I get another error: -bash-3.2$ -bash-3.2$ swift -tc.file tc.test.data -sites.file sshpbscoast.xml first.swift Swift svn swift-r3262 cog-r2729 (cog modified locally) RunID: 20100430-0105-nzzk6xxd Progress: Progress: Stage in:1 Progress: Submitted:1 Progress: Active:1 Failed to transfer wrapper log from first-20100430-0105-nzzk6xxd/info/x on ec2 Progress: Failed:1 Execution failed: Exception in echo: Arguments: [Hello, world!] Host: ec2 Directory: first-20100430-0105-nzzk6xxd/jobs/x/echo-xvom1arj stderr.txt: stdout.txt: ---- Caused by: No status file was found. Check the shared filesystem on ec2 Cleaning up... Shutting down service at https://10.251.214.179:48615 Got channel MetaChannel: 1317572826 -> GSSSChannel-11921994068(1) + Done and the coaster-bootstrap log: [torqueuser at ip-10-251-214-179 ~]$ [torqueuser at ip-10-251-214-179 ~]$ cat coaster-bootstrap-11921994068.log using plain mode BS: http://tp-login2.ci.uchicago.edu:57278 which: no gmd5sum in (/opt/vdt-1.10.1/gums/scripts:/opt/vdt-1.10.1/prima/bin:/opt/vdt-1.10.1/cert-scripts/bin:/opt/vdt-1.10.1/glite/sbin:/opt/vdt-1.10.1/glite/bin:/opt/vdt-1.10.1/jdk1.5/bin:/opt/vdt-1.10.1/edg/sbin:/opt/vdt-1.10.1/gip/bin:/opt/vdt-1.10.1/gpt/sbin:/opt/vdt-1.10.1/globus/bin:/opt/vdt-1.10.1/globus/sbin:/opt/vdt-1.10.1/wget/bin:/opt/vdt-1.10.1/logrotate/sbin:/opt/vdt-1.10.1/perl/bin:/opt/pacman-3.26/bin:/opt/vdt-1.10.1/vdt/sbin:/opt/vdt-1.10.1/vdt/bin:/opt/vdt-1.10.1/gums/scripts:/opt/vdt-1.10.1/prima/bin:/opt/vdt-1.10.1/cert-scripts/bin:/opt/vdt-1.10.1/glite/sbin:/opt/vdt-1.10.1/glite/bin:/opt/vdt-1.10.1/jdk1.5/bin:/opt/vdt-1.10.1/edg/sbin:/opt/vdt-1.10.1/gip/bin:/opt/vdt-1.10.1/gpt/sbin:/opt/vdt-1.10.1/wget/bin:/opt/vdt-1.10.1/logrotate/sbin:/opt/vdt-1.10.1/perl/bin:/opt/pacman-3.26/bin:/opt/vdt-1.10.1/vdt/sbin:/opt/vdt-1.10.1/vdt/bin:/sbin:/usr/sbin:/bin:/usr/bin:/usr/X11R6/bin) Expected checksum: 9017a89a3a700d9866592187fdb27b5b Computed checksum: 9017a89a3a700d9866592187fdb27b5b JAVA=/opt/vdt-1.10.1/jdk1.5/bin/java plain /opt/vdt-1.10.1/jdk1.5/bin/java -Djava=/opt/vdt-1.10.1/jdk1.5/bin/java -DGLOBUS_TCP_PORT_RANGE= -DX509_USER_PROXY=/home/torqueuser/.globus/job/ec2-204-236-204-71.compute-1.amazonaws.com/31355.1272607512/x509_up -DX509_CERT_DIR=/etc/grid-security/certificates -DGLOBUS_HOSTNAME=ec2-204-236-204-71.compute-1.amazonaws.com -jar /tmp/bootstrap.t31454 http://tp-login2.ci.uchicago.edu:57278 https://128.135.125.117:54201 11921994068 Canceling job 28.ip-10-251-214-179.ec2.internal EC: 0 [torqueuser at ip-10-251-214-179 ~]$ [torqueuser at ip-10-251-214-179 ~]$ > 2. This is for the DN issue with gt2:gt2:pbs: Edit /etc/hosts and make > sure that the expected DN is the first entry for the internal IP passed > to the coaster service. If the entry is not in there at all, add it. > This is a way to impersonate a Globus service and possibly do a > man-in-the-middle thing, but it may also work to fix the DN mismatch > problem. > > Mihael > > > by modify the entry in /etc/hosts to the expect DN address, so solve the DNS mismatch problem, but still get an " No status file was found. Check the shared filesystem on ec2" error As same as the one mentioned above. -Yi Zhu -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Fri Apr 30 08:57:07 2010 From: wilde at mcs.anl.gov (wilde at mcs.anl.gov) Date: Fri, 30 Apr 2010 08:57:07 -0500 (CDT) Subject: [Swift-devel] Re: Problem with incorrect host cert DN in coaster GSI authentication In-Reply-To: <9075314.765831272635357739.JavaMail.root@zimbra> Message-ID: <6803038.766261272635827815.JavaMail.root@zimbra> Yi, I'll leave question 2 for Mihael. For your first problem (using a link to qsub) the logs below suggest that your coaster workers did start (but you should verify this). Look in your ~/.globus/scripts directory to see if an error was returned by qsub. (I suspect not, since this would likely have generated a message on stdout/err and in your swift run log. Look in your workdirectory (from sites.xml) to see if the results of echo were generated. Look in your coaster logs to see if there were any problems in launching coasters (bot from below it looks like not. Try changing echo to sleep 12345 to see if the app is starting or not. (then use qstat to find the node, and ssh to the node to see if the "sleep" is running). If it got that far, perhaps a configuration error is preventing the result from getting back successfully. I can meet with you after 5PM if you get stuck in debugging. - Mike ----- "Yi Zhu" wrote: > On 4/29/2010 6:18 PM, Mihael Hategan wrote: > > On Thu, 2010-04-29 at 17:34 -0500, Yi Zhu wrote: > > HI, > > I've tried it with "gt2:pbs", and got a "qsub not found" error, for > further investigation, I pulled the env used by globus,and found > that there is no "/opt/torque-2.3.6/bin/qsub" under the PATH= ,I think > that's why cause "qsub not found" problem. > > Any suggested solution ? Two actually. > 1. This is for the qsub problem: you can add the relevant environment > variables (for Torque) in sites.xml. I've tried to add > /opt/torque-2.3.6/bin > > to the sites.xml, but still get the same error;" qsub is not found". > > make a link from /opt/torque-2.3.6/bin/qsub to /usr/bin seems works, > but I get another error: > > -bash-3.2$ > -bash-3.2$ swift -tc.file tc.test.data -sites.file sshpbscoast.xml > first.swift > Swift svn swift-r3262 cog-r2729 (cog modified locally) > > RunID: 20100430-0105-nzzk6xxd > Progress: > Progress: Stage in:1 > Progress: Submitted:1 > Progress: Active:1 > Failed to transfer wrapper log from > first-20100430-0105-nzzk6xxd/info/x on ec2 > Progress: Failed:1 > Execution failed: > Exception in echo: > Arguments: [Hello, world!] > Host: ec2 > Directory: first-20100430-0105-nzzk6xxd/jobs/x/echo-xvom1arj > stderr.txt: > > stdout.txt: > > ---- > > Caused by: > No status file was found. Check the shared filesystem on ec2 > Cleaning up... > Shutting down service at https://10.251.214.179:48615 > Got channel MetaChannel: 1317572826 -> GSSSChannel-11921994068(1) > + Done > > and the coaster-bootstrap log: > > [torqueuser at ip-10-251-214-179 ~]$ > [torqueuser at ip-10-251-214-179 ~]$ cat > coaster-bootstrap-11921994068.log > using plain mode > BS: http://tp-login2.ci.uchicago.edu:57278 > which: no gmd5sum in > (/opt/vdt-1.10.1/gums/scripts:/opt/vdt-1.10.1/prima/bin:/opt/vdt-1.10.1/cert-scripts/bin:/opt/vdt-1.10.1/glite/sbin:/opt/vdt-1.10.1/glite/bin:/opt/vdt-1.10.1/jdk1.5/bin:/opt/vdt-1.10.1/edg/sbin:/opt/vdt-1.10.1/gip/bin:/opt/vdt-1.10.1/gpt/sbin:/opt/vdt-1.10.1/globus/bin:/opt/vdt-1.10.1/globus/sbin:/opt/vdt-1.10.1/wget/bin:/opt/vdt-1.10.1/logrotate/sbin:/opt/vdt-1.10.1/perl/bin:/opt/pacman-3.26/bin:/opt/vdt-1.10.1/vdt/sbin:/opt/vdt-1.10.1/vdt/bin:/opt/vdt-1.10.1/gums/scripts:/opt/vdt-1.10.1/prima/bin:/opt/vdt-1.10.1/cert-scripts/bin:/opt/vdt-1.10.1/glite/sbin:/opt/vdt-1.10.1/glite/bin:/opt/vdt-1.10.1/jdk1.5/bin:/opt/vdt-1.10.1/edg/sbin:/opt/vdt-1.10.1/gip/bin:/opt/vdt-1.10.1/gpt/sbin:/opt/vdt-1.10.1/wget/bin:/opt/vdt-1.10.1/logrotate/sbin:/opt/vdt-1.10.1/perl/bin:/opt/pacman-3.26/bin:/opt/vdt-1.10.1/vdt/sbin:/opt/vdt-1.10.1/vdt/bin:/sbin:/usr/sbin:/bin:/usr/bin:/usr/X11R6/bin) > Expected checksum: 9017a89a3a700d9866592187fdb27b5b > Computed checksum: 9017a89a3a700d9866592187fdb27b5b > JAVA=/opt/vdt-1.10.1/jdk1.5/bin/java > plain /opt/vdt-1.10.1/jdk1.5/bin/java > -Djava=/opt/vdt-1.10.1/jdk1.5/bin/java -DGLOBUS_TCP_PORT_RANGE= > -DX509_USER_PROXY=/home/torqueuser/.globus/job/ec2-204-236-204-71.compute-1.amazonaws.com/31355.1272607512/x509_up > -DX509_CERT_DIR=/etc/grid-security/certificates > -DGLOBUS_HOSTNAME=ec2-204-236-204-71.compute-1.amazonaws.com -jar > /tmp/bootstrap.t31454 http://tp-login2.ci.uchicago.edu:57278 > https://128.135.125.117:54201 11921994068 > Canceling job 28.ip-10-251-214-179.ec2.internal > > EC: 0 > [torqueuser at ip-10-251-214-179 ~]$ > [torqueuser at ip-10-251-214-179 ~]$ > > > > > 2. This is for the DN issue with gt2:gt2:pbs: Edit /etc/hosts and make > sure that the expected DN is the first entry for the internal IP > passed > to the coaster service. If the entry is not in there at all, add it. > This is a way to impersonate a Globus service and possibly do a > man-in-the-middle thing, but it may also work to fix the DN mismatch > problem. > > Mihael by modify the entry in /etc/hosts to the expect DN address, so > solve the DNS mismatch problem, but still get an " No status file was > found. Check the shared filesystem on ec2" error As same as the one > mentioned above. > > -Yi Zhu -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From yizhu at cs.uchicago.edu Fri Apr 30 10:09:04 2010 From: yizhu at cs.uchicago.edu (Yi Zhu) Date: Fri, 30 Apr 2010 10:09:04 -0500 Subject: [Swift-devel] Re: Problem with incorrect host cert DN in coaster GSI authentication In-Reply-To: <6803038.766261272635827815.JavaMail.root@zimbra> References: <6803038.766261272635827815.JavaMail.root@zimbra> Message-ID: <4BDAF290.2000401@cs.uchicago.edu> Hi Mike I think I've found the problem , In coasters.log , there is an entry shows that /qstat/ OR qdel can not be found. so I hard link all the binaries in /opt/torque-2.3.6/bin to /usr/bin, and then i get a successfully response. this is not an perfect solution, it would be ideal to send the environment setting through profile files rather than make this hack on remote site, but anyway i can get coaster to work now. Many Thanks!!! -Yi Zhu On 4/30/2010 8:57 AM, wilde at mcs.anl.gov wrote: > Yi, > > I'll leave question 2 for Mihael. > > For your first problem (using a link to qsub) the logs below suggest that your coaster workers did start (but you should verify this). > > Look in your ~/.globus/scripts directory to see if an error was returned by qsub. (I suspect not, since this would likely have generated a message on stdout/err and in your swift run log. > > Look in your workdirectory (from sites.xml) to see if the results of echo were generated. > > Look in your coaster logs to see if there were any problems in launching coasters (bot from below it looks like not. > > Try changing echo to sleep 12345 to see if the app is starting or not. (then use qstat to find the node, and ssh to the node to see if the "sleep" is running). > > If it got that far, perhaps a configuration error is preventing the result from getting back successfully. > > I can meet with you after 5PM if you get stuck in debugging. > > - Mike > > ----- "Yi Zhu" wrote: > > >> On 4/29/2010 6:18 PM, Mihael Hategan wrote: >> >> On Thu, 2010-04-29 at 17:34 -0500, Yi Zhu wrote: >> >> HI, >> >> I've tried it with "gt2:pbs", and got a "qsub not found" error, for >> further investigation, I pulled the env used by globus,and found >> that there is no "/opt/torque-2.3.6/bin/qsub" under the PATH= ,I think >> that's why cause "qsub not found" problem. >> >> Any suggested solution ? Two actually. >> 1. This is for the qsub problem: you can add the relevant environment >> variables (for Torque) in sites.xml. I've tried to add >> /opt/torque-2.3.6/bin >> >> to the sites.xml, but still get the same error;" qsub is not found". >> >> make a link from /opt/torque-2.3.6/bin/qsub to /usr/bin seems works, >> but I get another error: >> >> -bash-3.2$ >> -bash-3.2$ swift -tc.file tc.test.data -sites.file sshpbscoast.xml >> first.swift >> Swift svn swift-r3262 cog-r2729 (cog modified locally) >> >> RunID: 20100430-0105-nzzk6xxd >> Progress: >> Progress: Stage in:1 >> Progress: Submitted:1 >> Progress: Active:1 >> Failed to transfer wrapper log from >> first-20100430-0105-nzzk6xxd/info/x on ec2 >> Progress: Failed:1 >> Execution failed: >> Exception in echo: >> Arguments: [Hello, world!] >> Host: ec2 >> Directory: first-20100430-0105-nzzk6xxd/jobs/x/echo-xvom1arj >> stderr.txt: >> >> stdout.txt: >> >> ---- >> >> Caused by: >> No status file was found. Check the shared filesystem on ec2 >> Cleaning up... >> Shutting down service at https://10.251.214.179:48615 >> Got channel MetaChannel: 1317572826 -> GSSSChannel-11921994068(1) >> + Done >> >> and the coaster-bootstrap log: >> >> [torqueuser at ip-10-251-214-179 ~]$ >> [torqueuser at ip-10-251-214-179 ~]$ cat >> coaster-bootstrap-11921994068.log >> using plain mode >> BS: http://tp-login2.ci.uchicago.edu:57278 >> which: no gmd5sum in >> (/opt/vdt-1.10.1/gums/scripts:/opt/vdt-1.10.1/prima/bin:/opt/vdt-1.10.1/cert-scripts/bin:/opt/vdt-1.10.1/glite/sbin:/opt/vdt-1.10.1/glite/bin:/opt/vdt-1.10.1/jdk1.5/bin:/opt/vdt-1.10.1/edg/sbin:/opt/vdt-1.10.1/gip/bin:/opt/vdt-1.10.1/gpt/sbin:/opt/vdt-1.10.1/globus/bin:/opt/vdt-1.10.1/globus/sbin:/opt/vdt-1.10.1/wget/bin:/opt/vdt-1.10.1/logrotate/sbin:/opt/vdt-1.10.1/perl/bin:/opt/pacman-3.26/bin:/opt/vdt-1.10.1/vdt/sbin:/opt/vdt-1.10.1/vdt/bin:/opt/vdt-1.10.1/gums/scripts:/opt/vdt-1.10.1/prima/bin:/opt/vdt-1.10.1/cert-scripts/bin:/opt/vdt-1.10.1/glite/sbin:/opt/vdt-1.10.1/glite/bin:/opt/vdt-1.10.1/jdk1.5/bin:/opt/vdt-1.10.1/edg/sbin:/opt/vdt-1.10.1/gip/bin:/opt/vdt-1.10.1/gpt/sbin:/opt/vdt-1.10.1/wget/bin:/opt/vdt-1.10.1/logrotate/sbin:/opt/vdt-1.10.1/perl/bin:/opt/pacman-3.26/bin:/opt/vdt-1.10.1/vdt/sbin:/opt/vdt-1.10.1/vdt/bin:/sbin:/usr/sbin:/bin:/usr/bin:/usr/X11R6/bin) >> Expected checksum: 9017a89a3a700d9866592187fdb27b5b >> Computed checksum: 9017a89a3a700d9866592187fdb27b5b >> JAVA=/opt/vdt-1.10.1/jdk1.5/bin/java >> plain /opt/vdt-1.10.1/jdk1.5/bin/java >> -Djava=/opt/vdt-1.10.1/jdk1.5/bin/java -DGLOBUS_TCP_PORT_RANGE= >> -DX509_USER_PROXY=/home/torqueuser/.globus/job/ec2-204-236-204-71.compute-1.amazonaws.com/31355.1272607512/x509_up >> -DX509_CERT_DIR=/etc/grid-security/certificates >> -DGLOBUS_HOSTNAME=ec2-204-236-204-71.compute-1.amazonaws.com -jar >> /tmp/bootstrap.t31454 http://tp-login2.ci.uchicago.edu:57278 >> https://128.135.125.117:54201 11921994068 >> Canceling job 28.ip-10-251-214-179.ec2.internal >> >> EC: 0 >> [torqueuser at ip-10-251-214-179 ~]$ >> [torqueuser at ip-10-251-214-179 ~]$ >> >> >> >> >> 2. This is for the DN issue with gt2:gt2:pbs: Edit /etc/hosts and make >> sure that the expected DN is the first entry for the internal IP >> passed >> to the coaster service. If the entry is not in there at all, add it. >> This is a way to impersonate a Globus service and possibly do a >> man-in-the-middle thing, but it may also work to fix the DN mismatch >> problem. >> >> Mihael by modify the entry in /etc/hosts to the expect DN address, so >> solve the DNS mismatch problem, but still get an " No status file was >> found. Check the shared filesystem on ec2" error As same as the one >> mentioned above. >> >> -Yi Zhu >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: