From iraicu at cs.uchicago.edu Tue Nov 2 14:47:36 2010 From: iraicu at cs.uchicago.edu (Ioan Raicu) Date: Tue, 02 Nov 2010 14:47:36 -0500 Subject: [Swift-user] Call for Participation: 3rd IEEE Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS10), co-located with Supercomputing 2010 -- win an Apple iPad!!! Message-ID: <4CD06AD8.1050201@cs.uchicago.edu> Dear all, We invite you to participate in the 3rd workshop on Many-Task Computing on Grids and Supercomputers (MTAGS10) on Monday, November 15th, 2010, co-located with IEEE/ACM Supercomputing 2010 in New Orleans LA. MTAGS will provide the scientific community a dedicated forum for presenting new research, development, and deployment efforts of large-scale many-task computing (MTC) applications on large scale clusters, Grids, Supercomputers, and Cloud Computing infrastructure. A few highlights of the workshop: * *Workshop Program: *The program can be found at http://www.cs.iit.edu/~iraicu/MTAGS10/program.htm; papers and slides will be posted by November 15th, 2010 * *Keynote speaker: *Roger Barga, PhD, Architect, Extreme Computing Group, Microsoft Research * *Best Paper Nominees: * o Timothy Armstrong, Mike Wilde, Daniel Katz, Zhao Zhang, Ian Foster. "/Scheduling Many-Task Workloads on Supercomputers: Dealing with Trailing Tasks/", 3rd IEEE Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS10) 2010 o Thomas Budnik, Brant Knudson, Mark Megerian, Sam Miller, Mike Mundy, Will Stockdell. "/Blue Gene/Q Resource Management Architecture/", 3rd IEEE Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS10) 2010 * *Attendance Prize: *There will be a /*free Apple iPad */giveaway at the end of the workshop; must attend at least 1 talk throughout the day at the workshop, and must be present to win at the end of the workshop at 6:15PM The workshop program is: * 9:00AM Opening Remarks * 9:10AM *Keynote: Data Laden Clouds, Roger Barga, PhD, Architect, Extreme Computing Group, Microsoft Research * * Session 1: Applications o 10:30AM Many Task Computing for Modeling the Fate of Oil Discharged from the Deep Water Horizon Well Blowout o 11:00AM Many-Task Applications in the Integrated Plasma Simulator o 11:30AM Compute and data management strategies for grid deployment of high throughput protein structure studies * Session 2: Storage o 1:30PM Processing Massive Sized Graphs Using Sector/Sphere o 2:00PM Easy and Instantaneous Processing for Data-Intensive Workflows o 2:30PM Detecting Bottlenecks in Parallel DAG-based Data Flow Programs * Session 3: Resource Management o 3:30PM Improving Many-Task Computing in Scientific Workflows Using P2P Techniques o 4:00PM Dynamic Task Scheduling for the Uintah Framework o 4:30PM Automatic and Coordinated Job Recovery for High Performance Computing * Session 4: Best Papers Nominees o 5:15PM Scheduling Many-Task Workloads on Supercomputers: Dealing with Trailing Tasks o 5:45PM Blue Gene/Q Resource Management Architecture * 6:15PM Best Paper Award, Attendees Prizes, & Closing Remarks We look forward to seeing you at the workshop in less than 2 weeks! Regards, Ioan Raicu, Yong Zhao, and Ian Foster MTAGS10 Chairs http://www.cs.iit.edu/~iraicu/MTAGS10/ -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor ================================================================= Computer Science Department Illinois Institute of Technology 10 W. 31st Street Stuart Building, Room 237D Chicago, IL 60616 ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ ================================================================= ================================================================= -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Wed Nov 3 15:37:07 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 3 Nov 2010 15:37:07 -0500 (CDT) Subject: [Swift-user] Re: 1.0 vs 1 In-Reply-To: <0663F080-2B06-45C2-A822-4936D3895BF3@iro.umontreal.ca> Message-ID: <329307237.9932.1288816627030.JavaMail.root@zimbra.anl.gov> Marc, the solution here is to say: string str_modulo = @strcat( int_index, ":1000" ); instead of: string str_modulo = @strcat( @tostring( int_index ), ":1000" ); Its a handy feature of @strcat() that it coerces its arguments to strings when they are numeric, and does it in the desired way, unlike @tostring(). I am not sure if @tostring() has always behaved this way (ie always formatting as if its argument is a float), or of that was a recent, perhaps undesirable or inadvertent change. This is the kind of question you should submit to swift-user for general discussion and so other developers can provide advice. Also, we have a sprintf()-like function that I think is not yet documented, if you need it. I need to find the details on that. - Mike ----- Original Message ----- > Hi guys, > > I'm trying to generate a string whose content should be "0:1000", > "1:1000", etc... > > foreach mod_index in [0:1] > { > int int_index = mod_index; > string str_modulo = @strcat( @tostring( int_index ), ":1000" ); > ... > } > > > that str_modulo string ends up to contain "0.0:1000", "1.0:1000", etc > > I thought it was an int? So the culprit is the tostring function then? > It is not defined for the int type so it is silently converted into > float then passed to the tostring function?? > > > Very Best, > Marc. -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From matthew.woitaszek at gmail.com Wed Nov 3 18:28:50 2010 From: matthew.woitaszek at gmail.com (Matthew Woitaszek) Date: Wed, 3 Nov 2010 17:28:50 -0600 Subject: [Swift-user] Coasters and PBS resource requests: nodes and ppn Message-ID: Good afternoon, Is there a way to update PBS resource requests when using coasters to supply modified PBS resource strings such as "nodes=1:ppn=8"? (Or other arbitrary resource requests, such as node properties?) Of course, I'm just trying to get coasters to allocate all of the processors on an 8-core node, using either the "gt2:gt2:pbs" or "local:pbs" provider. Both submit jobs just fine. I found no discernible difference with the "host_types" Globus namespace variable, presuming I'm setting it right. The particular cluster I'm using allows node packing for users that run lots of single-processor tasks, so without ppn, it will assume nodes=1,ncpus=1 and thus pack 8 jobs on each node before moving on to the next node. (I know it won't be an issue at sites that make nodes exclusive. On this system, the queue default is "nodes=1:ppn=8", but because coasters explicitly specifies the number of nodes in its generated resource request, the ppn default seems to get lost!) I see that this has been discussed as far back as 2007, and I found Marcin and Mike's previous discussion of the topic at http://mail.ci.uchicago.edu/pipermail/swift-user/2010-March/001409.html but there didn't seem to be any definitive conclusion. Any suggestions would be appreciated! Matthew -------------- next part -------------- An HTML attachment was scrubbed... URL: From aespinosa at cs.uchicago.edu Wed Nov 3 18:41:00 2010 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Wed, 3 Nov 2010 18:41:00 -0500 Subject: [Swift-user] Coasters and PBS resource requests: nodes and ppn In-Reply-To: References: Message-ID: Hi Matthew, Does this mean, coasters will now submit nodes=1;ppn=1 and do node packing? If there is no node packing being initiated by PBS, you can just specify workersPerNode=8 . But then what you request to PBS is now different to what you actually use. -Allan 2010/11/3 Matthew Woitaszek : > Good afternoon, > > Is there a way to update PBS resource requests when using coasters to supply > modified PBS resource strings such as "nodes=1:ppn=8"? (Or other arbitrary > resource requests, such as node properties?) > > Of course, I'm just trying to get coasters to allocate all of the processors > on an 8-core node, using either the "gt2:gt2:pbs" or "local:pbs" provider. > Both submit jobs just fine. I found no discernible difference with the > "host_types" Globus namespace variable, presuming I'm setting it right. > > The particular cluster I'm using allows node packing for users that run lots > of single-processor tasks, so without ppn, it will assume nodes=1,ncpus=1 > and thus pack 8 jobs on each node before moving on to the next node. (I know > it won't be an issue at sites that make nodes exclusive. On this system, the > queue default is "nodes=1:ppn=8", but because coasters explicitly specifies > the number of nodes in its generated resource request, the ppn default seems > to get lost!) > > I see that this has been discussed as far back as 2007, and I found Marcin > and Mike's previous discussion of the topic at > > ?? http://mail.ci.uchicago.edu/pipermail/swift-user/2010-March/001409.html > > but there didn't seem to be any definitive conclusion. Any suggestions would > be appreciated! > > Matthew > -- Allan M. Espinosa PhD student, Computer Science University of Chicago From wilde at mcs.anl.gov Thu Nov 4 10:06:58 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 4 Nov 2010 10:06:58 -0500 (CDT) Subject: [Swift-user] Coasters and PBS resource requests: nodes and ppn In-Reply-To: Message-ID: <234626292.12872.1288883218697.JavaMail.root@zimbra.anl.gov> [long response follows, sorry - I tried to condense but this hits messy issues] Hi Matthew, Your question hits issues that we need to resolve and and do more testing on. Most common modes seem to be working, but I have been worried that some bugs remain and its possible - but not 100% clear - that we'll need more attribute-setting control. I think that some node-packing issues Marcin encountered on the Argonne PBS Fusion cluster went unresolved. Specifically, Ive been suspicious that with automated (default) coaster operation, there may be cases with PBS and SGE where we either get too few (1 instead of N) or too many (N^2 instead of N) jobs running per node. I'll try to cover these by answering your questions, below. ----- Original Message ----- > Good afternoon, > > Is there a way to update PBS resource requests when using coasters to > supply modified PBS resource strings such as "nodes=1:ppn=8"? (Or > other arbitrary resource requests, such as node properties?) Not that I know of. You can set the number of cores that should be used on each node using the coasters pool attribute "workersPerNode". But see issues in the table below. You can also start coaster workers manually, in which case you can set any scheduler attributes explicitly. We have a growing set of scripts that enable this, but they're not ready for release yet. We hope to integrate this option into the evolving swiftconfig/swiftrun tools that you may have seen discussed on the list and which are in the trunk but not yet documented in the users guide. Lets discuss this possibility in a separate thread if after reading thing you feel you need it. > Of course, I'm just trying to get coasters to allocate all of the > processors on an 8-core node, using either the "gt2:gt2:pbs" or > "local:pbs" provider. Both submit jobs just fine. I found no > discernible difference with the "host_types" Globus namespace > variable, presuming I'm setting it right. Did you try just setting workersPerNode (in the Globus profile) to 8? This should be working with coasters on PBS and gt2:gt2:pbs, and Im pretty sure is working on TCC Ranger (and SGE machine with N=16). Note that this attribute is in the "Globus" profile set but that's a misnomer - many attributes in that profile affect coasters and the local providers and are unrelated to Globus operation per se. > The particular cluster I'm using allows node packing for users that > run lots of single-processor tasks, so without ppn, it will assume > nodes=1,ncpus=1 and thus pack 8 jobs on each node before moving on to > the next node. (I know it won't be an issue at sites that make nodes > exclusive. On this system, the queue default is "nodes=1:ppn=8", but > because coasters explicitly specifies the number of nodes in its > generated resource request, the ppn default seems to get lost!) You can set "debug=true" in etc/provider-pbs.properties, and then Swift will retain the submit file in $HOME/.globus/scripts, so you can verify the scheduler directives that Swift is setting. > I see that this has been discussed as far back as 2007, and I found > Marcin and Mike's previous discussion of the topic at > > http://mail.ci.uchicago.edu/pipermail/swift-user/2010-March/001409.html Right - that issue is still unresolved. I'll try to push it forward. Can you help us determine the right conventions and then verify that they are working for you? The issue, I think, is that the user either needs to know whether or not the scheduler does node packing, or a way to specify job attributes in a way that makes such knowledge un-necessary. What I think we need is: - a set of attributes that forces the scheduler to allocate complete nodes and give the user control over how many jobs to run per node - a set of attributes that assumes the scheduler *will* pack nodes, and that does the right thing in that case. In summary I think the current situation is this: - when coasters submits a 1-node job: -- workersPerNode=1 o works fine if scheduler packs nodes o uses only 1 core if scheduler does not pack nodes -- workersPerNode=N o runs up to N^2 tasks if scheduler packs nodes o works fine if scheduler does not pack nodes - when coasters submits an N-node job, N>1 -- workersPerNode=1 o works fine if scheduler packs nodes o uses only 1 core per node if scheduler does not pack nodes -- workersPerNode=N o runs up to N^2 tasks if scheduler packs nodes o works fine if scheduler does not pack nodes Based on the above cases, it seems that "all is fine" as long as the sites description is set based on whether the scheduler will node-pack or not, and that workersPerNode is set correctly. Typically, you want to run either in 1-core packing mode, or N-core full-node-allocation mode. I *think* that some schedulers (pbs, maybe sge) may determine packing behavior based on the queue, or in the case of SGE, perhaps by the parallel environment (PE). For now, the user must know how to match the sites.xml spec to the behavior of the target cluster. We're trying to work out suggested specs for all the clusters in the Argonne/UChicago/TeraGrid mode, and most of Open Science Grid as well. I am worried that there remains an issue in what we call "multi-node" operation. When a coaster job ("slot") uses more than one node, then Swift itself needs to start the coaster agent, worker.pl, on each node in the job. This is done with explicit shell code that Swift places in the submit file. I have argued in the past that we simply need one more attribute (I called it coresPerNode) which tells swift exactly how many cores per node to request from the scheduler. The two typical values would be 1 (if the user wants to use node packing) and N, where N is the actual number of cores per node, when the user wants to allocate entire nodes. I think this may be needed for SGE but possible not PBS. Im pretty sure we have cases in SGE local-scheduler coaster mode where the provider needs this in order to formulate a submit file that SGE will accept. Mihael did not agree that this was necessary, and we never resolved the issue. So what we need to discuss, test and resolve is: - is the Coaster provider correctly handling both "node-packed" and "full-node" mode? - is it handling these modes correctly with both the local-scheduler parent provider and with GT2? - with SGE, are we working correctly with all or most processing environments and job-launching programs? How we we know, for a given SGE deployment? Are we starting multi-node jobs correctly on all schedulers in all modes? - do we have a sufficient way to set scheduler attributes? - we need to automate the testing of the many mode combinations that are likely to be used. If you are willing to help define (or even develop) and test improvements, we'd welcome your assistance. Sorry for the long response. I think with more analysis we can simplify the issue. Regards, Mike > but there didn't seem to be any definitive conclusion. Any suggestions > would be appreciated! > > Matthew > > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From matthew.woitaszek at gmail.com Thu Nov 4 10:06:48 2010 From: matthew.woitaszek at gmail.com (Matthew Woitaszek) Date: Thu, 4 Nov 2010 09:06:48 -0600 Subject: [Swift-user] Coasters and PBS resource requests: nodes and ppn In-Reply-To: References: Message-ID: Hi Allan, Yep, that's it. When the coasters resource request comes in with just "nodes=1", it gets interpreted by PBS as nodes=1:ppn=1, and thus PBS puts other jobs on the node, too, until all 8 CPUs are allocated (e.g., 8 1-cpu PBS jobs are running on it). I'd like to find some way to make the request as: nodes=1:ppn=8 along with workersPerNode=8 so that PBS allocates one node and all 8 processors, and then one Coasters job would put 8 workers on it, matching the resource request with the use. Matthew On Wed, Nov 3, 2010 at 5:41 PM, Allan Espinosa wrote: > Hi Matthew, > > Does this mean, coasters will now submit nodes=1;ppn=1 and do node packing? > > If there is no node packing being initiated by PBS, you can just > specify workersPerNode=8 . But then what you request to PBS is now > different to what you actually use. > > -Allan > > 2010/11/3 Matthew Woitaszek : > > Good afternoon, > > > > Is there a way to update PBS resource requests when using coasters to > supply > > modified PBS resource strings such as "nodes=1:ppn=8"? (Or other > arbitrary > > resource requests, such as node properties?) > > > > Of course, I'm just trying to get coasters to allocate all of the > processors > > on an 8-core node, using either the "gt2:gt2:pbs" or "local:pbs" > provider. > > Both submit jobs just fine. I found no discernible difference with the > > "host_types" Globus namespace variable, presuming I'm setting it right. > > > > The particular cluster I'm using allows node packing for users that run > lots > > of single-processor tasks, so without ppn, it will assume nodes=1,ncpus=1 > > and thus pack 8 jobs on each node before moving on to the next node. (I > know > > it won't be an issue at sites that make nodes exclusive. On this system, > the > > queue default is "nodes=1:ppn=8", but because coasters explicitly > specifies > > the number of nodes in its generated resource request, the ppn default > seems > > to get lost!) > > > > I see that this has been discussed as far back as 2007, and I found > Marcin > > and Mike's previous discussion of the topic at > > > > > http://mail.ci.uchicago.edu/pipermail/swift-user/2010-March/001409.html > > > > but there didn't seem to be any definitive conclusion. Any suggestions > would > > be appreciated! > > > > Matthew > > > > -- > Allan M. Espinosa > PhD student, Computer Science > University of Chicago > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Thu Nov 4 10:20:12 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 4 Nov 2010 10:20:12 -0500 (CDT) Subject: [Swift-user] Coasters and PBS resource requests: nodes and ppn In-Reply-To: Message-ID: <634679765.13000.1288884012023.JavaMail.root@zimbra.anl.gov> Let me see if Mihael or I can add the ppn spec as a simple test to experiment with. I'll look for my coresPerNode mod which I think did this. - Mike ----- Original Message ----- Hi Allan, Yep, that's it. When the coasters resource request comes in with just "nodes=1", it gets interpreted by PBS as nodes=1:ppn=1, and thus PBS puts other jobs on the node, too, until all 8 CPUs are allocated (e.g., 8 1-cpu PBS jobs are running on it). I'd like to find some way to make the request as: nodes=1:ppn=8 along with workersPerNode=8 so that PBS allocates one node and all 8 processors, and then one Coasters job would put 8 workers on it, matching the resource request with the use. Matthew On Wed, Nov 3, 2010 at 5:41 PM, Allan Espinosa < aespinosa at cs.uchicago.edu > wrote: Hi Matthew, Does this mean, coasters will now submit nodes=1;ppn=1 and do node packing? If there is no node packing being initiated by PBS, you can just specify workersPerNode=8 . But then what you request to PBS is now different to what you actually use. -Allan 2010/11/3 Matthew Woitaszek < matthew.woitaszek at gmail.com >: > Good afternoon, > > Is there a way to update PBS resource requests when using coasters to supply > modified PBS resource strings such as "nodes=1:ppn=8"? (Or other arbitrary > resource requests, such as node properties?) > > Of course, I'm just trying to get coasters to allocate all of the processors > on an 8-core node, using either the "gt2:gt2:pbs" or "local:pbs" provider. > Both submit jobs just fine. I found no discernible difference with the > "host_types" Globus namespace variable, presuming I'm setting it right. > > The particular cluster I'm using allows node packing for users that run lots > of single-processor tasks, so without ppn, it will assume nodes=1,ncpus=1 > and thus pack 8 jobs on each node before moving on to the next node. (I know > it won't be an issue at sites that make nodes exclusive. On this system, the > queue default is "nodes=1:ppn=8", but because coasters explicitly specifies > the number of nodes in its generated resource request, the ppn default seems > to get lost!) > > I see that this has been discussed as far back as 2007, and I found Marcin > and Mike's previous discussion of the topic at > > http://mail.ci.uchicago.edu/pipermail/swift-user/2010-March/001409.html > > but there didn't seem to be any definitive conclusion. Any suggestions would > be appreciated! > > Matthew > -- Allan M. Espinosa < http://amespinosa.wordpress.com > PhD student, Computer Science University of Chicago < http://people.cs.uchicago.edu/~aespinosa > _______________________________________________ Swift-user mailing list Swift-user at ci.uchicago.edu http://mail.ci.uchicago.edu/mailman/listinfo/swift-user -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Nov 4 15:41:51 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 04 Nov 2010 13:41:51 -0700 Subject: [Swift-user] Coasters and PBS resource requests: nodes and ppn In-Reply-To: <634679765.13000.1288884012023.JavaMail.root@zimbra.anl.gov> References: <634679765.13000.1288884012023.JavaMail.root@zimbra.anl.gov> Message-ID: <1288903311.3175.1.camel@blabla2.none> You can add the ppn profile in sites.xml: 8 This works in trunk and might work in the stable branch. Mihael On Thu, 2010-11-04 at 10:20 -0500, Michael Wilde wrote: > Let me see if Mihael or I can add the ppn spec as a simple test to > experiment with. I'll look for my coresPerNode mod which I think did > this. > > > - Mike > > > ______________________________________________________________________ > Hi Allan, > > Yep, that's it. When the coasters resource request comes in > with just "nodes=1", it gets interpreted by PBS as > nodes=1:ppn=1, and thus PBS puts other jobs on the node, too, > until all 8 CPUs are allocated (e.g., 8 1-cpu PBS jobs are > running on it). > > I'd like to find some way to make the request as: > nodes=1:ppn=8 > along with > workersPerNode=8 > so that PBS allocates one node and all 8 processors, and then > one Coasters job would put 8 workers on it, matching the > resource request with the use. > > Matthew > > > > > On Wed, Nov 3, 2010 at 5:41 PM, Allan Espinosa > wrote: > Hi Matthew, > > Does this mean, coasters will now submit nodes=1;ppn=1 > and do node packing? > > If there is no node packing being initiated by PBS, > you can just > specify workersPerNode=8 . But then what you request > to PBS is now > different to what you actually use. > > -Allan > > 2010/11/3 Matthew Woitaszek > : > > > Good afternoon, > > > > Is there a way to update PBS resource requests when > using coasters to supply > > modified PBS resource strings such as > "nodes=1:ppn=8"? (Or other arbitrary > > resource requests, such as node properties?) > > > > Of course, I'm just trying to get coasters to > allocate all of the processors > > on an 8-core node, using either the "gt2:gt2:pbs" or > "local:pbs" provider. > > Both submit jobs just fine. I found no discernible > difference with the > > "host_types" Globus namespace variable, presuming > I'm setting it right. > > > > The particular cluster I'm using allows node packing > for users that run lots > > of single-processor tasks, so without ppn, it will > assume nodes=1,ncpus=1 > > and thus pack 8 jobs on each node before moving on > to the next node. (I know > > it won't be an issue at sites that make nodes > exclusive. On this system, the > > queue default is "nodes=1:ppn=8", but because > coasters explicitly specifies > > the number of nodes in its generated resource > request, the ppn default seems > > to get lost!) > > > > I see that this has been discussed as far back as > 2007, and I found Marcin > > and Mike's previous discussion of the topic at > > > > > http://mail.ci.uchicago.edu/pipermail/swift-user/2010-March/001409.html > > > > but there didn't seem to be any definitive > conclusion. Any suggestions would > > be appreciated! > > > > Matthew > > > > > -- > Allan M. Espinosa > PhD student, Computer Science > University of Chicago > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From matthew.woitaszek at gmail.com Thu Nov 4 16:35:08 2010 From: matthew.woitaszek at gmail.com (Matthew Woitaszek) Date: Thu, 4 Nov 2010 15:35:08 -0600 Subject: [Swift-user] Coasters and PBS resource requests: nodes and ppn In-Reply-To: <1288903311.3175.1.camel@blabla2.none> References: <634679765.13000.1288884012023.JavaMail.root@zimbra.anl.gov> <1288903311.3175.1.camel@blabla2.none> Message-ID: Hi Mihael, Unfortunately, I can't seem to get that to work... I just did a svn update on the trunk. With that line in sites.xml, I still get the nodes=1 in the PBS file in ~/.globus/scripts using the local provider; the net result also seems the same with the gt2 provider. Do you have an example that I can try to make sure I'm not botching it up? Matthew On Thu, Nov 4, 2010 at 2:41 PM, Mihael Hategan wrote: > You can add the ppn profile in sites.xml: > 8 > > This works in trunk and might work in the stable branch. > > Mihael > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Thu Nov 4 17:26:47 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 4 Nov 2010 17:26:47 -0500 (CDT) Subject: [Swift-user] Coasters and PBS resource requests: nodes and ppn In-Reply-To: Message-ID: <341420523.17327.1288909607801.JavaMail.root@zimbra.anl.gov> I'll try later as well. Mihael, Im wondering (from what I saw when I did the earlier experiement with a coresPerNode variable, if we need to add the line in BlockTask.java to copy the attribute from the coaster jobspec to the block's jobspec? - Mike ----- Original Message ----- Hi Mihael, Unfortunately, I can't seem to get that to work... I just did a svn update on the trunk. With that line in sites.xml, I still get the nodes=1 in the PBS file in ~/.globus/scripts using the local provider; the net result also seems the same with the gt2 provider. Do you have an example that I can try to make sure I'm not botching it up? Matthew On Thu, Nov 4, 2010 at 2:41 PM, Mihael Hategan < hategan at mcs.anl.gov > wrote: You can add the ppn profile in sites.xml: 8 This works in trunk and might work in the stable branch. Mihael -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Nov 4 17:35:18 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 04 Nov 2010 15:35:18 -0700 Subject: [Swift-user] Coasters and PBS resource requests: nodes and ppn In-Reply-To: <341420523.17327.1288909607801.JavaMail.root@zimbra.anl.gov> References: <341420523.17327.1288909607801.JavaMail.root@zimbra.anl.gov> Message-ID: <1288910118.3775.0.camel@blabla2.none> On Thu, 2010-11-04 at 17:26 -0500, Michael Wilde wrote: > I'll try later as well. Mihael, Im wondering (from what I saw when I > did the earlier experiement with a coresPerNode variable, if we need > to add the line in BlockTask.java to copy the attribute from the > coaster jobspec to the block's jobspec? It should be copied. Maybe it's a bug. I'll have to try and see. Mihael From matthew.woitaszek at gmail.com Thu Nov 4 18:14:32 2010 From: matthew.woitaszek at gmail.com (Matthew Woitaszek) Date: Thu, 4 Nov 2010 17:14:32 -0600 Subject: [Swift-user] Coasters local:pbs and the coasters.log file debug output Message-ID: When running with coasters using the gt2:gt2:pbs provider, an excellent log file pops up in ~/.globus/coasters/coasters.log containing useful lines like: DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:... This file isn't generated when using the local:pbs provider, at least out-of-box. Is there a way to turn that output log file back on, or to get those debug lines in the Swift job log file, using the local:pbs provider? Thanks for your time, Matthew -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Thu Nov 4 22:47:52 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 4 Nov 2010 22:47:52 -0500 (CDT) Subject: [Swift-user] Coasters local:pbs and the coasters.log file debug output In-Reply-To: Message-ID: <601384125.18037.1288928872763.JavaMail.root@zimbra.anl.gov> Matthew, When the first token in the coaster jobmanager parameter is "local" as in "local:pbs", then the coaster server threads run inside the main swift JVM, and thus log to the main swift log. thats the file with the long name starting with your swift script file name, followed by a timestamp and unique ID, and ending in .log. All the log entries that you'd find in the coaster.log file should, I think, be in the main log. - Mike ----- Original Message ----- When running with coasters using the gt2:gt2:pbs provider, an excellent log file pops up in ~/.globus/coasters/coasters.log containing useful lines like: DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:... This file isn't generated when using the local:pbs provider, at least out-of-box. Is there a way to turn that output log file back on, or to get those debug lines in the Swift job log file, using the local:pbs provider? Thanks for your time, Matthew _______________________________________________ Swift-user mailing list Swift-user at ci.uchicago.edu http://mail.ci.uchicago.edu/mailman/listinfo/swift-user -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthew.woitaszek at gmail.com Fri Nov 5 14:54:49 2010 From: matthew.woitaszek at gmail.com (Matthew Woitaszek) Date: Fri, 5 Nov 2010 13:54:49 -0600 Subject: [Swift-user] Coasters local:pbs and the coasters.log file debug output In-Reply-To: <601384125.18037.1288928872763.JavaMail.root@zimbra.anl.gov> References: <601384125.18037.1288928872763.JavaMail.root@zimbra.anl.gov> Message-ID: Hi Mike, I do see some of the Coasters messages in the file, such as the INFO Cpu pull messages. With apologies for a dumb question: Can you point me to how to turn on the DEBUG TaskImpl messages? Matthew On Thu, Nov 4, 2010 at 9:47 PM, Michael Wilde wrote: > Matthew, > > When the first token in the coaster jobmanager parameter is "local" as in > "local:pbs", then the coaster server threads run inside the main swift JVM, > and thus log to the main swift log. thats the file with the long name > starting with your swift script file name, followed by a timestamp and > unique ID, and ending in .log. > > All the log entries that you'd find in the coaster.log file should, I > think, be in the main log. > > - Mike > > > ------------------------------ > > > When running with coasters using the gt2:gt2:pbs provider, an excellent log > file pops up in > ~/.globus/coasters/coasters.log > containing useful lines like: > DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:... > > This file isn't generated when using the local:pbs provider, at least > out-of-box. > > Is there a way to turn that output log file back on, or to get those debug > lines in the Swift job log file, using the local:pbs provider? > > Thanks for your time, > > Matthew > > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user > > > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Fri Nov 5 15:28:16 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 5 Nov 2010 15:28:16 -0500 (CDT) Subject: [Swift-user] Provider staging vs coaster data provider In-Reply-To: <989023705.21562.1288986250337.JavaMail.root@zimbra.anl.gov> Message-ID: <1952703623.21905.1288988896236.JavaMail.root@zimbra.anl.gov> Mihael, you may have explained this to me already, but can you clarify: In addition to provider staging using the coaster provider, it seems you can say (or ) which uses the coaster channel and agent to move data? One difference is that with provider staging, *all* file transfer for all sites is done via that method, correct? If a coaster filesystem provider is available, then the user can use that on selected sites, while other sites can use any other provider, correct? Provider staging with the coaster provider is done via the coaster data channel and all data goes directly to a job directory typically placed on the worker node local filesystem, right? Now, with coaster data provider staging, does that same restriction apply, or can the coaster data provider (assuming it really exists) place data on any path accessible to the worker? I.e., one could use a standard shared workdirectory if one was so inclined, although that would be counter-productive. Once I have all the above straight, I'm going to try to figure out how these two methods interact with CDM. Thanks, Mike -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Fri Nov 5 17:53:25 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 5 Nov 2010 17:53:25 -0500 (CDT) Subject: [Swift-user] Coasters local:pbs and the coasters.log file debug output In-Reply-To: Message-ID: <483086198.22863.1288997605521.JavaMail.root@zimbra.anl.gov> Matthew, thats actually a very *good* question, and I dont know the answer (but we should document it). I think you want to set a log4j property in the swift file etc/log4j.properties. Hopefully the message below, from the swift-user list provides a clue. Im not sure this is the exact setting you need, or something similar. I think to find the precise answer you'd need to look for the message you want in the source code under provider-coaster and then set the corresponding log4j property to debug. See also Justin's posts on simplified logging which may interact with this. Mihael can provide the exact answer; Justin is offline at the moment. Mike ----- Forwarded Message ----- From: "Justin M Wozniak" To: "Allan Espinosa" Cc: "Swift-User" Sent: Wednesday, October 20, 2010 1:23:41 PM Subject: Re: [Swift-user] log4j settings of vdl:* elements Try log4j.logger.swift=DEBUG See org.griphyn.vdl.karajan.lib.Log for more info. On Wed, 20 Oct 2010, Allan Espinosa wrote: > Hi, > > I think I have asked this before, but can't find the previous posts about it. > > I woud like to set the vdl:execute2 log level to DEBUG. Which > package/class path should I adjust in log4j.properties? > > Thanks, > -Allan ----- Original Message ----- Hi Mike, I do see some of the Coasters messages in the file, such as the INFO Cpu pull messages. With apologies for a dumb question: Can you point me to how to turn on the DEBUG TaskImpl messages? Matthew On Thu, Nov 4, 2010 at 9:47 PM, Michael Wilde < wilde at mcs.anl.gov > wrote: Matthew, When the first token in the coaster jobmanager parameter is "local" as in "local:pbs", then the coaster server threads run inside the main swift JVM, and thus log to the main swift log. thats the file with the long name starting with your swift script file name, followed by a timestamp and unique ID, and ending in .log. All the log entries that you'd find in the coaster.log file should, I think, be in the main log. - Mike When running with coasters using the gt2:gt2:pbs provider, an excellent log file pops up in ~/.globus/coasters/coasters.log containing useful lines like: DEBUG TaskImpl Task(type=JOB_SUBMISSION, identity=urn:... This file isn't generated when using the local:pbs provider, at least out-of-box. Is there a way to turn that output log file back on, or to get those debug lines in the Swift job log file, using the local:pbs provider? Thanks for your time, Matthew _______________________________________________ Swift-user mailing list Swift-user at ci.uchicago.edu http://mail.ci.uchicago.edu/mailman/listinfo/swift-user -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Sun Nov 7 13:57:55 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 07 Nov 2010 11:57:55 -0800 Subject: [Swift-user] Coasters and PBS resource requests: nodes and ppn In-Reply-To: <1288910118.3775.0.camel@blabla2.none> References: <341420523.17327.1288909607801.JavaMail.root@zimbra.anl.gov> <1288910118.3775.0.camel@blabla2.none> Message-ID: <1289159875.29800.1.camel@blabla2.none> On Thu, 2010-11-04 at 15:35 -0700, Mihael Hategan wrote: > On Thu, 2010-11-04 at 17:26 -0500, Michael Wilde wrote: > > I'll try later as well. Mihael, Im wondering (from what I saw when I > > did the earlier experiement with a coresPerNode variable, if we need > > to add the line in BlockTask.java to copy the attribute from the > > coaster jobspec to the block's jobspec? > > It should be copied. Maybe it's a bug. I'll have to try and see. It wasn't copied and it wasn't a bug, but my misunderstanding. Attributes are not directly copied since there is no one-to-one mapping between jobs and coaster blocks. So theoretically some "merge" operation needs to exist. I added "ppn" as one of the attributes that is copied from the first job, so the scenario I mentioned should now work. This is cog r2927/trunk. Mihael From hategan at mcs.anl.gov Sun Nov 7 14:03:52 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 07 Nov 2010 12:03:52 -0800 Subject: [Swift-user] Re: Provider staging vs coaster data provider In-Reply-To: <1952703623.21905.1288988896236.JavaMail.root@zimbra.anl.gov> References: <1952703623.21905.1288988896236.JavaMail.root@zimbra.anl.gov> Message-ID: <1289160232.29800.7.camel@blabla2.none> On Fri, 2010-11-05 at 15:28 -0500, Michael Wilde wrote: > Mihael, you may have explained this to me already, but can you clarify: > > In addition to provider staging using the coaster provider, it seems > you can say (or ) which uses > the coaster channel and agent to move data? Yes. > > One difference is that with provider staging, *all* file transfer for > all sites is done via that method, correct? And without provider staging all file transfer for all sites is done without provider staging (i.e. the converse is also true). This is a consequence of vdl-int getting too messy to have both in the same run, but it's not a theoretical impossibility. > > If a coaster filesystem provider is available, then the user can use > that on selected sites, while other sites can use any other provider, > correct? That is correct. > > Provider staging with the coaster provider is done via the coaster > data channel and all data goes directly to a job directory typically > placed on the worker node local filesystem, right? Also correct. > > Now, with coaster data provider staging, does that same restriction > apply, or can the coaster data provider (assuming it really exists) > place data on any path accessible to the worker? I.e., one could use a > standard shared workdirectory if one was so inclined, although that > would be counter-productive. The coaster data provider works like any other data provider and its usage is consistent with swift's traditional way of working. In other words, files are copied to a shared directory, cached there, and the shared directory must be accessible to the worker node. Though which one of the methods is "restrictive" I don't know. Mihael From hategan at mcs.anl.gov Sun Nov 7 14:08:57 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 07 Nov 2010 12:08:57 -0800 Subject: [Swift-user] Coasters local:pbs and the coasters.log file debug output In-Reply-To: References: <601384125.18037.1288928872763.JavaMail.root@zimbra.anl.gov> Message-ID: <1289160537.29800.12.camel@blabla2.none> On Fri, 2010-11-05 at 13:54 -0600, Matthew Woitaszek wrote: > Hi Mike, > > I do see some of the Coasters messages in the file, such as the INFO > Cpu pull messages. > > With apologies for a dumb question: Can you point me to how to turn on > the DEBUG TaskImpl messages? You would say something like: log4j.logger.org.globus.cog.abstraction=DEBUG in log4j.properties. Which log4j.properties that is depends on how you run this. In one of the local modes (i.e. coaster service in the same JVM as swift = coaster messages in swift logs) you would edit swift-dist/etc/log4j.properties. In the remote service case, and assuming you are not using a persistent coaster service, you would have to edit cog/modules/provider-coaster/resources/log4j.properties and re-compile swift. I don't much like this, but it is like that for now. Mihael From matthew.woitaszek at gmail.com Sun Nov 7 17:57:05 2010 From: matthew.woitaszek at gmail.com (Matthew Woitaszek) Date: Sun, 7 Nov 2010 16:57:05 -0700 Subject: [Swift-user] Coasters local:pbs and the coasters.log file debug output In-Reply-To: <1289160537.29800.12.camel@blabla2.none> References: <601384125.18037.1288928872763.JavaMail.root@zimbra.anl.gov> <1289160537.29800.12.camel@blabla2.none> Message-ID: Hi Mihael, Thanks so much for the logging pointers! I have to confess that I'm really inexperienced with the log4j/Java logging facility, so those details were exactly what I needed. I tracked down the Coasters configuration in cog/modules/provider-coaster/resources/log4j.properties and the two lines that control what I was interested in were painfully obvious: log4j.logger.org.globus.cog.abstraction.coaster=INFO log4j.logger.org.globus.cog.abstraction.impl.common.task.TaskImpl=DEBUG After copying those over to swift-dist/etc/log4j.properties, the Submitting|Submitted|Active messages I'm interested in get included when running Swift with Coasters in the "local" (one JVM) mode. Thanks again for your help! Matthew On Sun, Nov 7, 2010 at 1:08 PM, Mihael Hategan wrote: > On Fri, 2010-11-05 at 13:54 -0600, Matthew Woitaszek wrote: > > Hi Mike, > > > > I do see some of the Coasters messages in the file, such as the INFO > > Cpu pull messages. > > > > With apologies for a dumb question: Can you point me to how to turn on > > the DEBUG TaskImpl messages? > > You would say something like: > log4j.logger.org.globus.cog.abstraction=DEBUG > in log4j.properties. > > Which log4j.properties that is depends on how you run this. In one of > the local modes (i.e. coaster service in the same JVM as swift = coaster > messages in swift logs) you would edit swift-dist/etc/log4j.properties. > > In the remote service case, and assuming you are not using a persistent > coaster service, you would have to edit > cog/modules/provider-coaster/resources/log4j.properties and re-compile > swift. I don't much like this, but it is like that for now. > > Mihael > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthew.woitaszek at gmail.com Mon Nov 8 11:34:03 2010 From: matthew.woitaszek at gmail.com (Matthew Woitaszek) Date: Mon, 8 Nov 2010 10:34:03 -0700 Subject: [Swift-user] Coasters and PBS resource requests: nodes and ppn In-Reply-To: <1289159875.29800.1.camel@blabla2.none> References: <341420523.17327.1288909607801.JavaMail.root@zimbra.anl.gov> <1288910118.3775.0.camel@blabla2.none> <1289159875.29800.1.camel@blabla2.none> Message-ID: Mihael, I confirm that the "ppn" attribute now gets passed through to PBS, which can be used to force Torque-based clusters using the local:pbs provider to allocate the entire node. I tested with 1 and 2 nodes. This is exactly what I was hoping for -- thank you very much. * * * One observational note: At least on my Torque-scheduled cluster, using -l nodes=1:ppn=8 puts 8 copies of the hostname in the PBS_NODEFILE. Since the Coasters multi-node PBS script does a simple cat/loop/ssh over PBS_NODEFILE, when using PPN > 1, ppn copies of the Perl script get run on each node. Thus, it's important that workersPerNode be set to 1. 1 This works fine for me. I'll defer to the broader discussion of nodes, workers per node, and the variables that make things work... regarding whether something like NODE= "cat $PBS_NODEFILE | sort | uniq" would be prefered to run just one script per node with worker count control returned to workersPerNode... NODES=`cat $PBS_NODEFILE` [could be edited to enforce only one entry per physical node] ... for NODE in $NODES; do ... ssh $NODE /bin/bash -c ... Matthew On Sun, Nov 7, 2010 at 12:57 PM, Mihael Hategan wrote: > > Attributes are not directly copied since there is no one-to-one mapping > between jobs and coaster blocks. So theoretically some "merge" operation > needs to exist. > > I added "ppn" as one of the attributes that is copied from the first > job, so the scenario I mentioned should now work. > > This is cog r2927/trunk. > > Mihael > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aespinosa at cs.uchicago.edu Mon Nov 8 20:50:54 2010 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Mon, 8 Nov 2010 20:50:54 -0600 Subject: [Swift-user] throttle transfers and vdl:stagein graphs Message-ID: Hi, In my workflow, I use the default throttle.transfers=4 . But my dostagein-total plot indicates that there are 72 stagein events going on for around 90 seconds. shouldn't there be a linear ramp up or a saw-tooth pattern at the plateau because of having throttled transfers? Or am I looking at the wrong setting for this behavior? Thanks! -Allan -- Allan M. Espinosa PhD student, Computer Science University of Chicago -------------- next part -------------- A non-text attachment was scrubbed... Name: dostagein-total.png Type: image/png Size: 3488 bytes Desc: not available URL: From hategan at mcs.anl.gov Mon Nov 8 22:34:50 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 08 Nov 2010 20:34:50 -0800 Subject: [Swift-user] throttle transfers and vdl:stagein graphs In-Reply-To: References: Message-ID: <1289277290.18134.12.camel@blabla2.none> On Mon, 2010-11-08 at 20:50 -0600, Allan Espinosa wrote: > Hi, > > In my workflow, I use the default throttle.transfers=4 . But my > dostagein-total plot indicates that there are 72 stagein events going > on for around 90 seconds. shouldn't there be a linear ramp up or a > saw-tooth pattern at the plateau because of having throttled > transfers? Lies. And statistics. The plot indicates that a number of instances of a certain portion of vdl-int is executing. If you look at that portion of vdl-int (i.e. between setprogress("Stage in") and setprogress("Submitting")) there are a few things happening, including directory creation. Essentially you are dealing with the following pattern: parallelFor(... a() throttle(4, b()) c() ) The graph would show something like the parallelism in the invocation of the body of parallelFor. And it is quite possible that all a() invocations start well before any of the b() invocations start. The only accurate way to see the effect of the throttle is to trace the b() invocations, which you can probably do by looking at the status of file transfer tasks (by enabling the relevant logging stuff). Mihael From mparisien at uchicago.edu Tue Nov 9 13:50:23 2010 From: mparisien at uchicago.edu (Marc Parisien) Date: Tue, 9 Nov 2010 13:50:23 -0600 Subject: [Swift-user] using queuing system Message-ID: Hi All, I'm sorry, but I really don't know what I'm doing. This is what I want do to: make swift use the queuing system on the IBI cluster (qsub/qstat). I made a sites.xml file, like this: IBI has 8-cores nodes and I want to use max 4-cores/node. -------------------------------------------------------- 4 4096 1 2.55 10000 /cchome/mparis_x/swift -------------------------------------------------------- Here's the shell trace of the exec: -------------------------------------------------------- Swift svn swift-r3649 (swift modified locally) cog-r2890 (cog modified locally) RunID: 20101109-1328-he8pfis2 Progress: Progress: Selecting site:3 Initializing site shared directory:1 Progress: Selecting site:2 Initializing site shared directory:1 Stage in:1 Progress: Stage in:1 Submitting:3 Progress: Submitted:3 Active:1 Progress: Active:4 Worker task failed: org.globus.cog.abstraction.impl.scheduler.common.ProcessException: Exitcode file not found 5 queue polls after the job was reported done at org.globus.cog.abstraction.impl.scheduler.common.Job.close(Job.java:66) at org.globus.cog.abstraction.impl.scheduler.common.Job.setState(Job.java:177) at org.globus.cog.abstraction.impl.scheduler.pbs.QueuePoller.processStdout(QueuePoller.java:126) at org.globus.cog.abstraction.impl.scheduler.common.AbstractQueuePoller.pollQueue(AbstractQueuePoller.java:169) at org.globus.cog.abstraction.impl.scheduler.common.AbstractQueuePoller.run(AbstractQueuePoller.java:82) at java.lang.Thread.run(Unknown Source) Progress: Active:3 Failed but can retry:1 Progress: Stage in:1 Failed but can retry:3 Progress: Stage in:1 Active:2 Failed but can retry:1 Progress: Active:4 Progress: Active:4 Progress: Active:4 Progress: Active:4 Progress: Active:3 Checking status:1 Progress: Active:2 Checking status:1 Finished successfully:1 Progress: Checking status:1 Finished successfully:3 Progress: Checking status:1 Finished successfully:4 Final status: Finished successfully:5 Cleaning up... Shutting down service at https://172.16.0.149:52228 Got channel MetaChannel: 1235930463[1821457857: {}] -> null[1821457857: {}] + Done -------------------------------------------------------- Q1. When I log into the node that processes the job, I see that it has spawned 8 processes, but my swift script should only spawn at most 4 (because my for loop is [0:3]). Why? Because of the retries?? Q2. The worker task seems to fail; but then seems to come back on it's feet (Active:4)? Active:4... no no, top tells me there are 8 processes running at the same time! Q3. The swift returns, qstat shows that I don't have anything queued, but if I log into the node that treated the job, I still see active processes: [mparis_x at compute-14-41 ~]$ ps aux | grep mparis_x mparis_x 12754 0.0 0.0 8696 1004 ? Ss 13:29 0:00 bash /opt/gridengine/default/spool/compute-14-41/job_scripts/4026341 mparis_x 12755 0.0 0.0 39360 6368 ? S 13:29 0:00 /usr/bin/perl /cchome/mparis_x/.globus/coasters/cscript5281277440516286419.pl http://172.16.0.149:50608,http://172.18.0.149:50608,http://172.20.0.1:50608,http://172.30.0.1:50608 1109-290113-000000 /cchome/mparis_x/.globus/coasters -> are these going to "finish" anytime by themselves... they just seem to hang there... Thanks for your time, Marc. From wilde at mcs.anl.gov Tue Nov 9 14:35:00 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 9 Nov 2010 14:35:00 -0600 (CST) Subject: [Swift-user] using queuing system In-Reply-To: Message-ID: <1466816887.36711.1289334900697.JavaMail.root@zimbra.anl.gov> Hi Marc, The IBI cluster I think is an SGE machine, not PBS. I had sent you previously a non-coaster-based sites entry that looked like: threaded .49 10000 $(pwd)/swiftwork so change the coaster version you posted, below, to this: threaded 4 128 1 1 5.11 10000 /cchome/mparis_x/swift changes above are: - added "pe" tag, needed for SGE ("parallel environment"). "threaded" seems to be the right PE for ibicluster. - changed slots to 128: submit up to 128 SGE jobs at once - nodeGranularity 1, maxnodes 1: each job should request 1 node - throttle to allow up to 512 swift app() calls to run at once (4x128) Then, change your tc.data file to read "sge" instead of "pbs". Lastly, set the following in a file named "cf": wrapperlog.always.transfer=true sitedir.keep=true execution.retries=0 lazy.errors=false status.mode=provider use.provider.staging=false provider.staging.pin.swiftfiles=false and run swift using a command similar to this: swift -config cf -sites.file sites.xml -tc.file tc.data yourscript.swift -args=etc (changing the file names to match yours) I will need to add a config file into my "latest/" Swift release on ibicluster to retain SGE submit files and stdout/err logs. But for now, you can proceed as above, without that. More notes below... ----- Original Message ----- > Hi All, > > I'm sorry, but I really don't know what I'm doing. This is what I want > do to: make swift use the queuing system on the IBI cluster > (qsub/qstat). > > > > I made a sites.xml file, like this: > IBI has 8-cores nodes and I want to use max 4-cores/node. > > -------------------------------------------------------- > > > > > 4 > 4096 > 1 > > > 2.55 > 10000 > > > /cchome/mparis_x/swift > > > -------------------------------------------------------- > > > Here's the shell trace of the exec: > > -------------------------------------------------------- > Swift svn swift-r3649 (swift modified locally) cog-r2890 (cog modified > locally) > > RunID: 20101109-1328-he8pfis2 > Progress: > Progress: Selecting site:3 Initializing site shared directory:1 > Progress: Selecting site:2 Initializing site shared directory:1 Stage > in:1 > Progress: Stage in:1 Submitting:3 > Progress: Submitted:3 Active:1 > Progress: Active:4 > Worker task failed: > org.globus.cog.abstraction.impl.scheduler.common.ProcessException: > Exitcode file not found 5 queue polls after the job was reported done > at I suspect this is because Swift submitted PBS-style jobs to SGE, based on the incorrect sites.xml attributes. > at java.lang.Thread.run(Unknown Source) > Progress: Active:3 Failed but can retry:1 > Progress: Stage in:1 Failed but can retry:3 > Progress: Stage in:1 Active:2 Failed but can retry:1 > Progress: Active:4 > Progress: Active:4 > Progress: Active:4 > Progress: Active:4 > Progress: Active:3 Checking status:1 > Progress: Active:2 Checking status:1 Finished successfully:1 > Progress: Checking status:1 Finished successfully:3 > Progress: Checking status:1 Finished successfully:4 > Final status: Finished successfully:5 > Cleaning up... > Shutting down service at https://172.16.0.149:52228 > Got channel MetaChannel: 1235930463[1821457857: {}] -> > null[1821457857: {}] > + Done > -------------------------------------------------------- > > > > Q1. When I log into the node that processes the job, I see that it has > spawned 8 processes, but my swift script should only spawn at most 4 > (because my for loop is [0:3]). Why? Because of the retries?? I'm not sure. There should be one swift worker.pl process running per node. If the problem persists, please send a snapshot of what you see in ps using: ps -fjH -u mparis_x > Q2. The worker task seems to fail; but then seems to come back on it's > feet (Active:4)? Active:4... no no, top tells me there are 8 processes > running at the same time! > Q3. The swift returns, qstat shows that I don't have anything queued, > but if I log into the node that treated the job, I still see active > processes: > > [mparis_x at compute-14-41 ~]$ ps aux | grep mparis_x > mparis_x 12754 0.0 0.0 8696 1004 ? Ss 13:29 0:00 bash > /opt/gridengine/default/spool/compute-14-41/job_scripts/4026341 > mparis_x 12755 0.0 0.0 39360 6368 ? S 13:29 0:00 /usr/bin/perl > /cchome/mparis_x/.globus/coasters/cscript5281277440516286419.pl > http://172.16.0.149:50608,http://172.18.0.149:50608,http://172.20.0.1:50608,http://172.30.0.1:50608 > 1109-290113-000000 /cchome/mparis_x/.globus/coasters > > -> are these going to "finish" anytime by themselves... they just seem > to hang there... I'll look at this with you if it persists once we correct the sites file. > Thanks for your time, > Marc. > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From matthew.woitaszek at gmail.com Wed Nov 10 16:41:20 2010 From: matthew.woitaszek at gmail.com (Matthew Woitaszek) Date: Wed, 10 Nov 2010 15:41:20 -0700 Subject: [Swift-user] Coasters - idle time exceeded Message-ID: Good afternoon, While running using Coasters, I occasionally get messages like this: Idle time exceeded at /home/username/.globus/coasters/cscript....pl line 627. Then things go horribly wrong and the processing usually doesn't complete. At first I thought this was in cases where my workflow had a long tail and many workers were left idle as some long running tasks finished up -- a symptom of my "let's try this 512-task workflow with 64-128 cores and see what happens!" experimentation phase. I got around it by just requesting fewer nodes from PBS in my Coasters configuration. But now it's popping up on smaller workflows. The susceptible workflows seem to be preloaded with less than one node's worth of tasks on the first round of dependencies. Is there a way that I can increase the idle time limit? Ideally, I'd like the coasters to wait for the entire PBS job walltime. Matthew From wilde at mcs.anl.gov Wed Nov 10 19:47:24 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 10 Nov 2010 19:47:24 -0600 (CST) Subject: [Swift-user] Coasters - idle time exceeded In-Reply-To: Message-ID: <121721591.44480.1289440044368.JavaMail.root@zimbra.anl.gov> Hi Matthew, Could you send your swift .log file to us at swift-devel, as well as your sites.xml file, tc.data, and swift.properties (if you have changed them)? We'll also want to look at $HOME/.globus/coasters.log and any other coaster worker log files (from this run) that might be under .globus/coasters (although the latter is probably not there, as *I think* coasters doesnt write worker logs if there are more than some threshold of total works. We may need to reproduce this scenario here to debug it. Mihael may have better suggestions on how to proceed. - Mike ----- Original Message ----- > Good afternoon, > > While running using Coasters, I occasionally get messages like this: > > Idle time exceeded at /home/username/.globus/coasters/cscript....pl > line 627. > > Then things go horribly wrong and the processing usually doesn't > complete. > > At first I thought this was in cases where my workflow had a long tail > and many workers were left idle as some long running tasks finished up > -- a symptom of my "let's try this 512-task workflow with 64-128 cores > and see what happens!" experimentation phase. I got around it by just > requesting fewer nodes from PBS in my Coasters configuration. But now > it's popping up on smaller workflows. The susceptible workflows seem > to be preloaded with less than one node's worth of tasks on the first > round of dependencies. > > Is there a way that I can increase the idle time limit? Ideally, I'd > like the coasters to wait for the entire PBS job walltime. > > Matthew > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From hategan at mcs.anl.gov Wed Nov 10 21:06:51 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 10 Nov 2010 19:06:51 -0800 Subject: [Swift-user] Coasters - idle time exceeded In-Reply-To: References: Message-ID: <1289444811.24457.20.camel@blabla2.none> There is a way to increase that limit. That parameter also seems to be a command line argument, though I don't see it used in that way. In any event, look for "my $IDLETIMEOUT" in provider-coaster/resources/worker.pl and change the default there (4 * 60) to whatever you want (I suggest "very large number"). Then re-compile and re-run. The idle time was used in a previous version of the coasters (when there was no block allocation) as a mechanism to clean up unused workers. This is now done by the coaster service itself. The problem with letting the workers do this is that they have no knowledge that they are part of a block. In said previous version, a worker dying would be seen immediately by the service through the fact that the worker job ended. This is not the case with the current block scheme in which workers are part of multi-node jobs. The advantage of letting the workers do this is that it is simple algorithmically. So given the above, I'd be in favor of getting rid of this idle timeout. The only concern remaining is preventing workers from running when the coaster service has died. However, the heartbeat mechanism should take care of that. Opinions? Mihael On Wed, 2010-11-10 at 15:41 -0700, Matthew Woitaszek wrote: > Good afternoon, > > While running using Coasters, I occasionally get messages like this: > > Idle time exceeded at /home/username/.globus/coasters/cscript....pl line 627. > > Then things go horribly wrong and the processing usually doesn't complete. > > At first I thought this was in cases where my workflow had a long tail > and many workers were left idle as some long running tasks finished up > -- a symptom of my "let's try this 512-task workflow with 64-128 cores > and see what happens!" experimentation phase. I got around it by just > requesting fewer nodes from PBS in my Coasters configuration. But now > it's popping up on smaller workflows. The susceptible workflows seem > to be preloaded with less than one node's worth of tasks on the first > round of dependencies. > > Is there a way that I can increase the idle time limit? Ideally, I'd > like the coasters to wait for the entire PBS job walltime. > > Matthew > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user From aespinosa at cs.uchicago.edu Wed Nov 10 21:08:12 2010 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Wed, 10 Nov 2010 21:08:12 -0600 Subject: [Swift-user] concurrent_mapper filenames Message-ID: Blowup fblow[] ; Is this the expected generated sequence? blowup/ blowup/_concurrent blowup/_concurrent/blowup---array blowup/_concurrent/blowup---array/elt-10.dat blowup/_concurrent/blowup---array/elt-12.dat blowup/_concurrent/blowup---array/h8 blowup/_concurrent/blowup---array/h8/elt-108.dat blowup/_concurrent/blowup---array/elt-24.dat blowup/_concurrent/blowup---array/elt-11.dat blowup/_concurrent/blowup---array/h1 blowup/_concurrent/blowup---array/h1/elt-101.dat blowup/_concurrent/blowup---array/h7 blowup/_concurrent/blowup---array/h7/elt-157.dat blowup/_concurrent/blowup---array/h7/elt-107.dat blowup/_concurrent/blowup---array/elt-14.dat blowup/_concurrent/blowup---array/elt-15.dat blowup/_concurrent/blowup---array/h14 blowup/_concurrent/blowup---array/h14/elt-39.dat blowup/_concurrent/blowup---array/h14/elt-164.dat blowup/_concurrent/blowup---array/elt-7.dat blowup/_concurrent/blowup---array/h18 blowup/_concurrent/blowup---array/h18/elt-93.dat blowup/_concurrent/blowup---array/h18/elt-168.dat -- Allan M. Espinosa PhD student, Computer Science University of Chicago From mparisien at uchicago.edu Thu Nov 11 09:16:48 2010 From: mparisien at uchicago.edu (Marc Parisien) Date: Thu, 11 Nov 2010 09:16:48 -0600 Subject: [Swift-user] runnin' on IBI Message-ID: <9B56F526-7E0A-43FC-BF3C-A5D0B0B6F920@uchicago.edu> Hi, has anyone got to run stuff on the IBI cluster? I'ved tried twice now, and both attempts eventually failed. The error goes along the lines of: Caused by: Exception caught while reading exit code Caused by: org.globus.cog.abstraction.impl.scheduler.common.ProcessException: Exception caught while reading exit code Caused by: java.lang.NumberFormatException: null at java.lang.Integer.parseInt(Unknown Source) which would be strange for a program to terminate without an exit code... I attached the last 300 lines of the log file, if ever it helps! Many Thanks, Marc. -------------- next part -------------- A non-text attachment was scrubbed... Name: swift.log Type: application/octet-stream Size: 27895 bytes Desc: not available URL: From wilde at mcs.anl.gov Thu Nov 11 13:56:03 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 11 Nov 2010 13:56:03 -0600 (CST) Subject: [Swift-user] concurrent_mapper filenames In-Reply-To: Message-ID: <1642454117.48440.1289505363610.JavaMail.root@zimbra.anl.gov> Were you expecting something like: blowup/blowup-array/elt-10.dat ? We could try to adjust the output behavior of this mapper. I'm more interested, though, in revamping out mapper set, and rolling out a new set while leaving the old ones in place (deprecated) for a while. - Mike ----- Original Message ----- > Blowup fblow[] prefix="blowup-", suffix=".dat">; > > Is this the expected generated sequence? > > blowup/ > blowup/_concurrent > blowup/_concurrent/blowup---array > blowup/_concurrent/blowup---array/elt-10.dat > blowup/_concurrent/blowup---array/elt-12.dat > blowup/_concurrent/blowup---array/h8 > blowup/_concurrent/blowup---array/h8/elt-108.dat > blowup/_concurrent/blowup---array/elt-24.dat > blowup/_concurrent/blowup---array/elt-11.dat > blowup/_concurrent/blowup---array/h1 > blowup/_concurrent/blowup---array/h1/elt-101.dat > blowup/_concurrent/blowup---array/h7 > blowup/_concurrent/blowup---array/h7/elt-157.dat > blowup/_concurrent/blowup---array/h7/elt-107.dat > blowup/_concurrent/blowup---array/elt-14.dat > blowup/_concurrent/blowup---array/elt-15.dat > blowup/_concurrent/blowup---array/h14 > blowup/_concurrent/blowup---array/h14/elt-39.dat > blowup/_concurrent/blowup---array/h14/elt-164.dat > blowup/_concurrent/blowup---array/elt-7.dat > blowup/_concurrent/blowup---array/h18 > blowup/_concurrent/blowup---array/h18/elt-93.dat > blowup/_concurrent/blowup---array/h18/elt-168.dat > > > -- > Allan M. Espinosa > PhD student, Computer Science > University of Chicago > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From aespinosa at cs.uchicago.edu Thu Nov 11 14:02:11 2010 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Thu, 11 Nov 2010 14:02:11 -0600 Subject: [Swift-user] concurrent_mapper filenames In-Reply-To: <1642454117.48440.1289505363610.JavaMail.root@zimbra.anl.gov> References: <1642454117.48440.1289505363610.JavaMail.root@zimbra.anl.gov> Message-ID: I was expecting something like blowup/blowup-XXXX.dat . But it doesn't matter since i'm only after the output. 2010/11/11 Michael Wilde : > Were you expecting something like: > ?blowup/blowup-array/elt-10.dat > ? > > We could try to adjust the output behavior of this mapper. > > I'm more interested, though, in revamping out mapper set, and rolling out a new set while leaving the old ones in place (deprecated) for a while. > > - Mike > > ----- Original Message ----- >> Blowup fblow[] > prefix="blowup-", suffix=".dat">; >> >> Is this the expected generated sequence? >> >> blowup/ >> blowup/_concurrent >> blowup/_concurrent/blowup---array >> blowup/_concurrent/blowup---array/elt-10.dat >> blowup/_concurrent/blowup---array/elt-12.dat >> blowup/_concurrent/blowup---array/h8 >> blowup/_concurrent/blowup---array/h8/elt-108.dat >> blowup/_concurrent/blowup---array/elt-24.dat >> blowup/_concurrent/blowup---array/elt-11.dat >> blowup/_concurrent/blowup---array/h1 >> blowup/_concurrent/blowup---array/h1/elt-101.dat >> blowup/_concurrent/blowup---array/h7 >> blowup/_concurrent/blowup---array/h7/elt-157.dat >> blowup/_concurrent/blowup---array/h7/elt-107.dat >> blowup/_concurrent/blowup---array/elt-14.dat >> blowup/_concurrent/blowup---array/elt-15.dat >> blowup/_concurrent/blowup---array/h14 >> blowup/_concurrent/blowup---array/h14/elt-39.dat >> blowup/_concurrent/blowup---array/h14/elt-164.dat >> blowup/_concurrent/blowup---array/elt-7.dat >> blowup/_concurrent/blowup---array/h18 >> blowup/_concurrent/blowup---array/h18/elt-93.dat >> blowup/_concurrent/blowup---array/h18/elt-168.dat >> >> -- Allan M. Espinosa PhD student, Computer Science University of Chicago From benc at hawaga.org.uk Thu Nov 11 14:05:32 2010 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 11 Nov 2010 20:05:32 +0000 (GMT) Subject: [Swift-user] concurrent_mapper filenames In-Reply-To: References: <1642454117.48440.1289505363610.JavaMail.root@zimbra.anl.gov> Message-ID: > I was expecting something like blowup/blowup-XXXX.dat . But it > doesn't matter since i'm only after the output. Roughly, the XXXXX number gets covnerted into a digit sequence in some base, and then each digit is used to make a new directory level with the last digit being the actual file name. The aim is to reduce the number of GPFS nodes accessing the same directory, which is/was a fairly serious scalability problem. -- From benc at hawaga.org.uk Thu Nov 11 14:01:51 2010 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 11 Nov 2010 20:01:51 +0000 (GMT) Subject: [Swift-user] concurrent_mapper filenames In-Reply-To: <1642454117.48440.1289505363610.JavaMail.root@zimbra.anl.gov> References: <1642454117.48440.1289505363610.JavaMail.root@zimbra.anl.gov> Message-ID: > Were you expecting something like: > blowup/blowup-array/elt-10.dat > ? > > We could try to adjust the output behavior of this mapper. The concurrent mappper was not intended for file names that you would expect to use outside of swift -there is already a mapper for numbering files based on eg array indices. Its going to generate "weird looking" filenames based on tuning I did to make it behave well on GPFS, rather than on filenames that you should expect to be able to predict. -- From aespinosa at cs.uchicago.edu Thu Nov 11 14:17:37 2010 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Thu, 11 Nov 2010 14:17:37 -0600 Subject: [Swift-user] concurrent_mapper filenames In-Reply-To: References: <1642454117.48440.1289505363610.JavaMail.root@zimbra.anl.gov> Message-ID: Ah. that makes sense. Thanks Ben! -Allan 2010/11/11 Ben Clifford : >> I was expecting something like blowup/blowup-XXXX.dat . ?But it >> doesn't matter since i'm only after the output. > > Roughly, the XXXXX number gets covnerted into a digit sequence in some > base, and then each digit is used to make a new directory level with the > last digit being the actual file name. > > The aim is to reduce the number of GPFS nodes accessing the same > directory, which is/was a fairly serious scalability problem. > From wilde at mcs.anl.gov Thu Nov 11 18:11:19 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 11 Nov 2010 18:11:19 -0600 (CST) Subject: [Swift-user] runnin' on IBI In-Reply-To: <9B56F526-7E0A-43FC-BF3C-A5D0B0B6F920@uchicago.edu> Message-ID: <190998605.50353.1289520679078.JavaMail.root@zimbra.anl.gov> Marc, can you point me to the directory in which you performed this run, andmake sure that I can access it? I am wondering if you changed swift.properties, either by specifying a -config file on the swift command line or by editng swift.properties in your swift build or $HOME/.swift directory? Specifically Im wondering what the property "status.mode" is set to? If its set to: status.mode=provider then can you try again with it set to: status.mode=file and vice verse? It looks to me as if its set (defaulting) to "file", but I cant tell without looking deeper into the swift code. As I recall you are using the SGE provider in non-coaster mode, right? If you have not changed anything in swift.properties, but you are specifying -sites.file and -tc.file on the command line, then you should create a file (call it "cf") with these lines, in the directory in which you run the swift command: wrapperlog.always.transfer=true sitedir.keep=true execution.retries=0 lazy.errors=false status.mode=provider use.provider.staging=false provider.staging.pin.swiftfiles=false and then run the swift command with the additional arg (at the front): swift -config cf -sites.file sites.xml etc etc This may not address the problem bit will help us diagnose it a bit further. - Mike ----- Original Message ----- > Hi, > > has anyone got to run stuff on the IBI cluster? > > I'ved tried twice now, and both attempts eventually failed. > The error goes along the lines of: > > > Caused by: Exception caught while reading exit code > Caused by: > org.globus.cog.abstraction.impl.scheduler.common.ProcessException: > Exception caught while reading exit code > Caused by: java.lang.NumberFormatException: null > at java.lang.Integer.parseInt(Unknown Source) > > > which would be strange for a program to terminate without an exit > code... > > > I attached the last 300 lines of the log file, if ever it helps! > > > Many Thanks, > Marc. > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Thu Nov 18 08:31:17 2010 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 18 Nov 2010 08:31:17 -0600 (CST) Subject: [Swift-user] runnin' on IBI In-Reply-To: <7B605614-990E-403A-B01B-248C913640F0@uchicago.edu> Message-ID: <438910860.76468.1290090677740.JavaMail.root@zimbra.anl.gov> Hi Marc, Sorry for the delayed response. I was suspicious that perhaps the (fairly) new SGE provider was not correctly handling error return codes. But I tested it (on ibicluster) with both non-zeroreturn codes and with apps that fail and raise a signal (I tried a divide by by 0 fault). Everything I tested worked - I was unable to cause the error you received. Then I tried to run your modftdock swift script - that worked as well (see below). This is in ~wilde/marc if you want to examine how I ran it (see run.sh) Then I tested with the default Java, which caused first.swift to fail. TO my surprise, that worked as well. So, the next things to do here are: - you may want to copy my ~marc directory and see if it works for you - I would like to get the full logs (and ideally the work directory) from the run(s) that failed for you - If you could, please try to reproduce the error again and point me to a directory with all your files and and all the files that swift produced. - also send me your $PATH value so I know what Java you used. (We should log this in the Swift log if its not there already). Thanks, Mike My output was: [wilde at ibicluster ~]$ pwd /cchome/wilde [wilde at ibicluster ~]$ cd marc [wilde at ibicluster marc]$ ./run.sh Swift svn swift-r3702 (swift modified locally) cog-r2924 (cog modified locally) RunID: 20101118-0819-01ot1210 Progress: SwiftScript trace: 1a25 SwiftScript trace: 1a2z Progress: Stage in:8 Progress: Submitting:6 Submitted:2 Progress: Submitting:3 Submitted:5 Progress: Submitted:8 Progress: Submitted:7 Active:1 Progress: Active:8 Progress: Active:8 Progress: Active:8 Progress: Active:8 Progress: Active:8 Progress: Active:7 Checking status:1 Progress: Active:4 Stage out:1 Finished successfully:3 Progress: Submitted:1 Active:4 Finished successfully:4 Progress: Active:5 Finished successfully:4 Progress: Active:4 Finished successfully:5 Progress: Active:4 Finished successfully:5 Progress: Active:3 Checking status:1 Finished successfully:5 Progress: Active:3 Finished successfully:6 Progress: Active:2 Checking status:1 Finished successfully:6 Progress: Active:2 Finished successfully:7 Progress: Active:1 Checking status:1 Finished successfully:7 Progress: Submitted:1 Finished successfully:9 Progress: Submitted:1 Finished successfully:9 Progress: Active:1 Finished successfully:9 Final status: Finished successfully:10 [wilde at ibicluster marc]$ ----- Original Message ----- > Hi Mike, > > > > Marc, can you point me to the directory in which you performed this > > run, andmake sure that I can access it? > > there's awful lots of file here... (see below) > > > > I am wondering if you changed swift.properties, either by specifying > > a -config file on the swift command line or by editng > > swift.properties in your swift build or $HOME/.swift directory? > > I use a "cf" file, and in it I have: > > wrapperlog.always.transfer=true > sitedir.keep=true > execution.retries=0 > lazy.errors=false > status.mode=provider > use.provider.staging=false > provider.staging.pin.swiftfiles=false > > -> I will change the status.mode for "file" tomorrow, and I will let > you know if it works! If not, then I'll open up my folders and let you > in :-D > > > > > As I recall you are using the SGE provider in non-coaster mode, > > right? > > That's it: > > > > > > ... > > > PS On Swift's website, perhaps you could put the parameter sets that > "work" or that are "specific" for each site/cluster...? I hope I'm the > first one using swift on IBI ;-) > > > As for my code in SVN, I have no problem with that, but my main > program has to be compiled 32 bits (I compiled it on godzilla). I > tried compiling it on a 64 bits machine but the program crashes when > executing (most likely because of the FFT lib it uses). I would prefer > transferring the exec instead of an in-place compilation. > > > I'll keep you informed, > Marc. -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From aespinosa at cs.uchicago.edu Thu Nov 18 18:54:29 2010 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Thu, 18 Nov 2010 18:54:29 -0600 Subject: [Swift-user] throttle transfers and vdl:stagein graphs In-Reply-To: <1289277290.18134.12.camel@blabla2.none> References: <1289277290.18134.12.camel@blabla2.none> Message-ID: Ah. I see that the the log entries that is nearest the actual task:transfer() task is in vdl:dostageinfile (no parallelFor loop here). But i still see the transfers being more than the throttle in swift.properties. One of the classes invoked by task:transfer() is /org/globus/cog/karajan/workflow/nodes/grid/GridTransfer.class ? I'll try adding this to my log4j.properties and see what will happen. -Allan 2010/11/8 Mihael Hategan : > On Mon, 2010-11-08 at 20:50 -0600, Allan Espinosa wrote: >> Hi, >> >> In my workflow, I use the default throttle.transfers=4 . ?But my >> dostagein-total plot indicates that there are 72 stagein events going >> on for around 90 seconds. ?shouldn't there be a linear ramp up or a >> saw-tooth pattern at the plateau because of having throttled >> transfers? > > Lies. And statistics. > > The plot indicates that a number of instances of a certain portion of > vdl-int is executing. > > If you look at that portion of vdl-int (i.e. between setprogress("Stage > in") and setprogress("Submitting")) there are a few things happening, > including directory creation. > > Essentially you are dealing with the following pattern: > > parallelFor(... > ?a() > ?throttle(4, b()) > ?c() > ) > > The graph would show something like the parallelism in the invocation of > the body of parallelFor. And it is quite possible that all a() > invocations start well before any of the b() invocations start. The only > accurate way to see the effect of the throttle is to trace the b() > invocations, which you can probably do by looking at the status of file > transfer tasks (by enabling the relevant logging stuff). > > Mihael From aespinosa at cs.uchicago.edu Thu Nov 18 18:54:29 2010 From: aespinosa at cs.uchicago.edu (Allan Espinosa) Date: Thu, 18 Nov 2010 18:54:29 -0600 Subject: [Swift-user] throttle transfers and vdl:stagein graphs In-Reply-To: <1289277290.18134.12.camel@blabla2.none> References: <1289277290.18134.12.camel@blabla2.none> Message-ID: Ah. I see that the the log entries that is nearest the actual task:transfer() task is in vdl:dostageinfile (no parallelFor loop here). But i still see the transfers being more than the throttle in swift.properties. One of the classes invoked by task:transfer() is /org/globus/cog/karajan/workflow/nodes/grid/GridTransfer.class ? I'll try adding this to my log4j.properties and see what will happen. -Allan 2010/11/8 Mihael Hategan : > On Mon, 2010-11-08 at 20:50 -0600, Allan Espinosa wrote: >> Hi, >> >> In my workflow, I use the default throttle.transfers=4 . ?But my >> dostagein-total plot indicates that there are 72 stagein events going >> on for around 90 seconds. ?shouldn't there be a linear ramp up or a >> saw-tooth pattern at the plateau because of having throttled >> transfers? > > Lies. And statistics. > > The plot indicates that a number of instances of a certain portion of > vdl-int is executing. > > If you look at that portion of vdl-int (i.e. between setprogress("Stage > in") and setprogress("Submitting")) there are a few things happening, > including directory creation. > > Essentially you are dealing with the following pattern: > > parallelFor(... > ?a() > ?throttle(4, b()) > ?c() > ) > > The graph would show something like the parallelism in the invocation of > the body of parallelFor. And it is quite possible that all a() > invocations start well before any of the b() invocations start. The only > accurate way to see the effect of the throttle is to trace the b() > invocations, which you can probably do by looking at the status of file > transfer tasks (by enabling the relevant logging stuff). > > Mihael From hategan at mcs.anl.gov Thu Nov 18 22:59:51 2010 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 18 Nov 2010 20:59:51 -0800 Subject: [Swift-user] throttle transfers and vdl:stagein graphs In-Reply-To: References: <1289277290.18134.12.camel@blabla2.none> Message-ID: <1290142791.32540.0.camel@blabla2.none> On Thu, 2010-11-18 at 18:54 -0600, Allan Espinosa wrote: > Ah. I see that the the log entries that is nearest the actual > task:transfer() task is in vdl:dostageinfile (no parallelFor loop > here). But i still see the transfers being more than the throttle in > swift.properties. > > One of the classes invoked by task:transfer() is > /org/globus/cog/karajan/workflow/nodes/grid/GridTransfer.class ? > > I'll try adding this to my log4j.properties and see what will happen. The transfer task status is the best choice. That's in the abstractions module. Mihael