From benc at hawaga.org.uk Thu Nov 1 05:45:16 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 1 Nov 2007 10:45:16 +0000 (GMT) Subject: [Swift-devel] script mapper In-Reply-To: <1193878744.18796.29.camel@blabla.mcs.anl.gov> References: <1193869309.10145.9.camel@blabla.mcs.anl.gov> <1193876387.18296.5.camel@blabla.mcs.anl.gov> <4729216A.5060805@mcs.anl.gov> <1193878744.18796.29.camel@blabla.mcs.anl.gov> Message-ID: On Wed, 31 Oct 2007, Mihael Hategan wrote: > > Is this commited/commitable? > > Committed. > > So the mapper is called "ext", takes a script via exec=, and then > > arbitrary mapper-specific args? ext is inconsistent with the convention used for other mapper names. -- From hategan at mcs.anl.gov Thu Nov 1 09:25:04 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 01 Nov 2007 09:25:04 -0500 Subject: [Swift-devel] script mapper In-Reply-To: References: <1193869309.10145.9.camel@blabla.mcs.anl.gov> <1193876387.18296.5.camel@blabla.mcs.anl.gov> <4729216A.5060805@mcs.anl.gov> <1193878744.18796.29.camel@blabla.mcs.anl.gov> Message-ID: <1193927104.30473.1.camel@blabla.mcs.anl.gov> On Thu, 2007-11-01 at 10:45 +0000, Ben Clifford wrote: > > On Wed, 31 Oct 2007, Mihael Hategan wrote: > > > > Is this commited/commitable? > > > > Committed. > > > > So the mapper is called "ext", takes a script via exec=, and then > > > arbitrary mapper-specific args? > > ext is inconsistent with the convention used for other mapper names. Yes. I think it's silly to have to add _mapper to all the mapper names. Or not? > From benc at hawaga.org.uk Thu Nov 1 09:26:57 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 1 Nov 2007 14:26:57 +0000 (GMT) Subject: [Swift-devel] script mapper In-Reply-To: <1193927104.30473.1.camel@blabla.mcs.anl.gov> References: <1193869309.10145.9.camel@blabla.mcs.anl.gov> <1193876387.18296.5.camel@blabla.mcs.anl.gov> <4729216A.5060805@mcs.anl.gov> <1193878744.18796.29.camel@blabla.mcs.anl.gov> <1193927104.30473.1.camel@blabla.mcs.anl.gov> Message-ID: On Thu, 1 Nov 2007, Mihael Hategan wrote: > Yes. I think it's silly to have to add _mapper to all the mapper names. > Or not? It is silly. It is also silly to have multiple conventions. Perhaps at Christmastime, they can all be renamed again. -- From hategan at mcs.anl.gov Thu Nov 1 09:33:12 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 01 Nov 2007 09:33:12 -0500 Subject: [Swift-devel] script mapper In-Reply-To: References: <1193869309.10145.9.camel@blabla.mcs.anl.gov> <1193876387.18296.5.camel@blabla.mcs.anl.gov> <4729216A.5060805@mcs.anl.gov> <1193878744.18796.29.camel@blabla.mcs.anl.gov> <1193927104.30473.1.camel@blabla.mcs.anl.gov> Message-ID: <1193927592.30473.8.camel@blabla.mcs.anl.gov> On Thu, 2007-11-01 at 14:26 +0000, Ben Clifford wrote: > > On Thu, 1 Nov 2007, Mihael Hategan wrote: > > > Yes. I think it's silly to have to add _mapper to all the mapper names. > > Or not? > > It is silly. > > It is also silly to have multiple conventions. Right. > > Perhaps at Christmastime, they can all be renamed again. Well, it's a choice. > From benc at hawaga.org.uk Thu Nov 1 11:57:10 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 1 Nov 2007 16:57:10 +0000 (GMT) Subject: [Swift-devel] ConcurrentMapper changes Message-ID: I just modified the way that ConcurrentMapper lays out files (r1437) You will likely not have encountered ConcurrentMapper by name. It is used when you do not specify a mapper for a dataset, for example for intermediate variables. Previously, all files named by this mapper were given a long name in the root directory of the submit and cache directories. When a large number of files were named in this fashion, for example in an array with thousands of elements, this would result in a file for each element and a root directory with thousands of files. Most immediately I encountered this problem working with Andrew Jamieson running on TeraPort using GPFS. Many hosts attempting to access one directory is severely unscalable on GPFS. The changes I have made add more structure to filenames generated by the ConcurrentMapper: 1. All files appear in a _concurrent/ subdirectory. 2. Simple/marker data typed files appear directly below _concurrent, named as before. For example: file outfile; might give a filename: _concurrent//outfile-3339612a-08e1-443d-bd14-2329080d2d94- 3. Structures are mapped to a sub-directory, with each element being a file in that subdirectory. For example, type footype { file left; file right; } footype structurefile; might give a directory: _concurrent//structurefile-c68b99dc-de3c-4288-822f-2ab3d4dc6427--field containing two files: _concurrent//structurefile-c68b99dc-de3c-4288-822f-2ab3d4dc6427--field/left _concurrent//structurefile-c68b99dc-de3c-4288-822f-2ab3d4dc6427--field/right 4. Array elements are placed in a subdirectory. Within that subdirectory, the index is using to construct a further hierarchy such that there will never be more than 50 directories/files in any one directory. For example: file manyfile[]; might give mappings like this: myfile[0] stored in: _concurrent//manyfile-0b91d809-37f5-46da-91c8-6c4a9157b06b--array/elt-0 myfile[22] stored in: _concurrent//manyfile-0b91d809-37f5-46da-91c8-6c4a9157b06b--array/elt-22 myfile[30] stored in: _concurrent//manyfile-0b91d809-37f5-46da-91c8-6c4a9157b06b--array/h5/elt-30 myfile[734] stored in: _concurrent//manyfile-bcdeedee-4df7-4d21-a207-d8051da3d133--array/h9/h4/elt-734 To form the paths, basically something like this happens: convert each number into base 25. discard the most significant digit. then starting at the least significant digit and working towards the most significant digit, make that digit into a subdirectory. For example, 734 in base 10 is (1) (4) (9) in base 25 so we form intermediate path /h9/h4/ Doing this means that for large arrays directory paths will grow, whilst for small arrays will be short; and the size of the array does not need to be known ahead of time. The constant '25' can easily be adjusted. Its a compiled-in constant defined in one place at the moment, but could be made into a mapper parameter. -- From hategan at mcs.anl.gov Thu Nov 1 12:10:45 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 01 Nov 2007 12:10:45 -0500 Subject: [Swift-devel] a vdc like thing Message-ID: <1193937046.4196.1.camel@blabla.mcs.anl.gov> http://labs.google.com/papers/chubby.html I think it at least hints to the dimension of the VDC problem, though I think things can be simplified by assuming only one local VDC. Mihael From benc at hawaga.org.uk Thu Nov 1 18:23:34 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 1 Nov 2007 23:23:34 +0000 (GMT) Subject: [Swift-devel] karajan scheduler hack Message-ID: Recently, I've been making runs with Andrew with a scheduler hack to stop karajan's site score go below -10. This has been useful in stopped clustered job failures from causing catastrophic slow down. Its not clear what if any easy change can be made that isn't so hackish to achieve this same benefit. -- From hategan at mcs.anl.gov Thu Nov 1 19:06:43 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 01 Nov 2007 19:06:43 -0500 Subject: [Swift-devel] karajan scheduler hack In-Reply-To: References: Message-ID: <1193962004.10923.11.camel@blabla.mcs.anl.gov> On Thu, 2007-11-01 at 23:23 +0000, Ben Clifford wrote: > Recently, I've been making runs with Andrew with a scheduler hack to stop > karajan's site score go below -10. > > This has been useful in stopped clustered job failures from causing > catastrophic slow down. > > Its not clear what if any easy change can be made that isn't so hackish to > achieve this same benefit. I think that would be a reasonable hack for now. However, I do think that the problem is the details of the algorithm not being well thought out. In principle, I think it should remain a feedback system (also given that the opportunistic scheduling for VDS paper pretty much did the same). Given that, and given that with the current assumptions all the feedback inputs are accounted for, I'm led to believe that this is a matter of properly specifying the feedback function. But I'm fuzzy on many things. > From wilde at mcs.anl.gov Sat Nov 3 18:19:00 2007 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 03 Nov 2007 18:19:00 -0500 Subject: [Swift-devel] Kickstart on Angle vs not? Message-ID: <472D01E4.5010204@mcs.anl.gov> Ben, whats the current tradeoff on running kickstart for the angle work? When I last checked with you kickstart still goes to one dir and will likely cause contention. I now realize that some of the same data can now be obtained from the wrapper logs. Better to avoid kickstart then, or do you intend to work on it this week? (making no value judgment here - just want your suggestion on most viable route for Angle...) From itf at mcs.anl.gov Sat Nov 3 18:26:27 2007 From: itf at mcs.anl.gov (=?utf-8?B?SWFuIEZvc3Rlcg==?=) Date: Sat, 3 Nov 2007 23:26:27 +0000 Subject: [Swift-devel] Kickstart on Angle vs not? Message-ID: <1575686083-1194132464-cardhu_decombobulator_blackberry.rim.net-1246028922-@bxe030.bisx.prod.on.blackberry> we use a different mechanbism to retrieve kickstart ouput vs. Log file output, it seems. I'd be interested to understand them. ------Original Message------ From: Mike Wilde Sender: swift-devel-bounces at ci.uchicago.edu To: swift-devel To: Benjamin Clifford Sent: Nov 3, 2007 6:19 PM Subject: [Swift-devel] Kickstart on Angle vs not? Ben, whats the current tradeoff on running kickstart for the angle work? When I last checked with you kickstart still goes to one dir and will likely cause contention. I now realize that some of the same data can now be obtained from the wrapper logs. Better to avoid kickstart then, or do you intend to work on it this week? (making no value judgment here - just want your suggestion on most viable route for Angle...) _______________________________________________ Swift-devel mailing list Swift-devel at ci.uchicago.edu http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel Sent via BlackBerry from T-Mobile From benc at hawaga.org.uk Sat Nov 3 18:53:32 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 3 Nov 2007 23:53:32 +0000 (GMT) Subject: [Swift-devel] Re: Kickstart on Angle vs not? In-Reply-To: <472D01E4.5010204@mcs.anl.gov> References: <472D01E4.5010204@mcs.anl.gov> Message-ID: On Sat, 3 Nov 2007, Michael Wilde wrote: > When I last checked with you kickstart still goes to one dir and will likely > cause contention. It doesn't any more - I changed it in the commits yesterday / thursday at the same time I did the other ones. Its not been heavily tested, though. > (making no value judgment here - just want your suggestion on most viable > route for Angle...) kickstart and the info logs provide somewhat different info. For measuring conflict on the shared filesystems, the info logs are probably more useful. -- From benc at hawaga.org.uk Sat Nov 3 18:59:04 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 3 Nov 2007 23:59:04 +0000 (GMT) Subject: [Swift-devel] Kickstart on Angle vs not? In-Reply-To: <1575686083-1194132464-cardhu_decombobulator_blackberry.rim.net-1246028922-@bxe030.bisx.prod.on.blackberry> References: <1575686083-1194132464-cardhu_decombobulator_blackberry.rim.net-1246028922-@bxe030.bisx.prod.on.blackberry> Message-ID: On Sat, 3 Nov 2007, Ian Foster wrote: > we use a different mechanbism to retrieve kickstart ouput vs. Log file > output, it seems. I'd be interested to understand them. kickstart records get sent back to the submit host automatically (subject to various configuration options). wrapper logs never get staged anywhere automatically. -- From wilde at mcs.anl.gov Sat Nov 3 23:34:02 2007 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 03 Nov 2007 23:34:02 -0500 Subject: [Swift-devel] Error in syncing job start with input file availability? Message-ID: <472D4BBA.6080404@mcs.anl.gov> In the angle run in ~benc/swift-logs/wilde/run121, it looks to me like theres a chance that a job attempted to start before its data was visible to the node (not sure, just suspicious). Its a 5-job angle run. 4 jobs worked. The 5th job failed, the one for index 1 of a 5-element input file array. The wf ran with ~wilde/swiftdata/* as the storage and work dir (2 subdirs below that). So it was on NFS. All 5 input files are in the shared/ dir, but the failing job is the one whose timestamp is last. (0, 2,3,4 worked; 1 failed) I also got 3 emails from PBS of the form: PBS Job Id: 1571647.tg-master.uc.teragrid.org Job Name: STDIN Aborted by PBS Server Job cannot be executed See Administrator for help all dated 8:05 PM, three consecutive job ids, *47, 48, 49. Q: Do these email messages indicate that the job was failed by PBS before the app was started, or do these messages indicate a non-zero app exit, eg, if its input file was missing? The input files on shared/ were dated: drwxr-xr-x 4 wilde allocate 512 2007-11-03 20:04:33.000000000 -0500 _concurrent/ -rw-r--r-- 1 wilde allocate 46747037 2007-11-03 20:04:52.000000000 -0500 pc1.pcap -rw-r--r-- 1 wilde allocate 46747037 2007-11-03 20:04:55.000000000 -0500 pc2.pcap -rw-r--r-- 1 wilde allocate 46747037 2007-11-03 20:04:52.000000000 -0500 pc3.pcap -rw-r--r-- 1 wilde allocate 46747037 2007-11-03 20:04:47.000000000 -0500 pc4.pcap -rw-r--r-- 1 wilde allocate 46747037 2007-11-03 20:04:51.000000000 -0500 pc5.pcap -rw-r--r-- 1 wilde allocate 813 2007-11-03 20:04:33.000000000 -0500 seq.sh -rw-r--r-- 1 wilde allocate 4848 2007-11-03 20:04:33.000000000 -0500 wrapper.sh The awf3*.log file shows: 2007-11-03 19:04:52,400-0600 DEBUG vdl:dostagein FILE_STAGE_IN_END file=file://localhost/pc1.pcap srchost=localhost srcdir= srcn\ ame=pc1.pcap desthost=UC destdir=awf3-20071103-1904-2z266pk3/shared/ provider=file 2007-11-03 19:04:52,400-0600 INFO vdl:dostagein END jobid=angle4-ujal0lji - Staging in finished 2007-11-03 19:04:52,401-0600 DEBUG vdl:execute2 JOB_START jobid=angle4-ujal0lji tr=angle4 arguments=[pc1.pcap, _concurrent/of-06\ 6b25e3-b85f-45ce-a674-fd295fe1feb4--array//elt-0, _concurrent/cf-6d786027-4199-47d5-897f-12df44978d24--array//elt-0] tmpdir=awf3\ -20071103-1904-2z266pk3/jobs/u/angle4-ujal0lji host=UC (Note that the logfile for some reason logs times 1 hour behind???) But the main suspicious thing above is that while the log shows stagin complete for pc1.pcap at 4:52 past the hour, the ls shows the file mod date to be 4:55 past the hour, while the job was started (queued?) at 4:52. If the job happened to hit the PBS queue right at the time PBS was doing a queue poll, it may have started right away, and somehow started before file pc1.pcap was visible to the worker node. Im not sure what if anything in the synchronization prevents this, especially if NFS close-to-open consistency is broken. (Which we are very suspicious of on this site and with Linux NFS in general). Lastly, i've run the identical workflow twice more now, and its worked with no change both times. Any ideas or other explanations for what may have happened here? Also, ideas why the swift log file shows times an hour behind? From wilde at mcs.anl.gov Sat Nov 3 23:44:57 2007 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 03 Nov 2007 23:44:57 -0500 Subject: [Swift-devel] Error in syncing job start with input file availability? In-Reply-To: <472D4BBA.6080404@mcs.anl.gov> References: <472D4BBA.6080404@mcs.anl.gov> Message-ID: <472D4E49.7080500@mcs.anl.gov> forgot to state: the two identical runs after this that worked are in same log dir, run122 and run123 - mike On 11/3/07 11:34 PM, Michael Wilde wrote: > In the angle run in ~benc/swift-logs/wilde/run121, it looks to me like > theres a chance that a job attempted to start before its data was > visible to the node (not sure, just suspicious). > > Its a 5-job angle run. 4 jobs worked. The 5th job failed, the one for > index 1 of a 5-element input file array. > > The wf ran with ~wilde/swiftdata/* as the storage and work dir (2 > subdirs below that). So it was on NFS. > > All 5 input files are in the shared/ dir, but the failing job is the one > whose timestamp is last. (0, 2,3,4 worked; 1 failed) > > I also got 3 emails from PBS of the form: > > PBS Job Id: 1571647.tg-master.uc.teragrid.org > Job Name: STDIN > Aborted by PBS Server > Job cannot be executed > See Administrator for help > > all dated 8:05 PM, three consecutive job ids, *47, 48, 49. > > Q: Do these email messages indicate that the job was failed by PBS > before the app was started, or do these messages indicate a non-zero app > exit, eg, if its input file was missing? > > The input files on shared/ were dated: > > drwxr-xr-x 4 wilde allocate 512 2007-11-03 20:04:33.000000000 > -0500 _concurrent/ > -rw-r--r-- 1 wilde allocate 46747037 2007-11-03 20:04:52.000000000 > -0500 pc1.pcap > -rw-r--r-- 1 wilde allocate 46747037 2007-11-03 20:04:55.000000000 > -0500 pc2.pcap > -rw-r--r-- 1 wilde allocate 46747037 2007-11-03 20:04:52.000000000 > -0500 pc3.pcap > -rw-r--r-- 1 wilde allocate 46747037 2007-11-03 20:04:47.000000000 > -0500 pc4.pcap > -rw-r--r-- 1 wilde allocate 46747037 2007-11-03 20:04:51.000000000 > -0500 pc5.pcap > -rw-r--r-- 1 wilde allocate 813 2007-11-03 20:04:33.000000000 > -0500 seq.sh > -rw-r--r-- 1 wilde allocate 4848 2007-11-03 20:04:33.000000000 > -0500 wrapper.sh > > The awf3*.log file shows: > > 2007-11-03 19:04:52,400-0600 DEBUG vdl:dostagein FILE_STAGE_IN_END > file=file://localhost/pc1.pcap srchost=localhost srcdir= srcn\ > ame=pc1.pcap desthost=UC destdir=awf3-20071103-1904-2z266pk3/shared/ > provider=file > 2007-11-03 19:04:52,400-0600 INFO vdl:dostagein END > jobid=angle4-ujal0lji - Staging in finished > 2007-11-03 19:04:52,401-0600 DEBUG vdl:execute2 JOB_START > jobid=angle4-ujal0lji tr=angle4 arguments=[pc1.pcap, _concurrent/of-06\ > 6b25e3-b85f-45ce-a674-fd295fe1feb4--array//elt-0, > _concurrent/cf-6d786027-4199-47d5-897f-12df44978d24--array//elt-0] > tmpdir=awf3\ > -20071103-1904-2z266pk3/jobs/u/angle4-ujal0lji host=UC > > (Note that the logfile for some reason logs times 1 hour behind???) > > But the main suspicious thing above is that while the log shows stagin > complete for pc1.pcap at 4:52 past the hour, the ls shows the file mod > date to be 4:55 past the hour, while the job was started (queued?) at 4:52. > > If the job happened to hit the PBS queue right at the time PBS was doing > a queue poll, it may have started right away, and somehow started before > file pc1.pcap was visible to the worker node. Im not sure what if > anything in the synchronization prevents this, especially if NFS > close-to-open consistency is broken. (Which we are very suspicious of on > this site and with Linux NFS in general). > > Lastly, i've run the identical workflow twice more now, and its worked > with no change both times. > > Any ideas or other explanations for what may have happened here? > > Also, ideas why the swift log file shows times an hour behind? > > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From benc at hawaga.org.uk Sat Nov 3 23:57:39 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 4 Nov 2007 04:57:39 +0000 (GMT) Subject: [Swift-devel] Error in syncing job start with input file availability? In-Reply-To: <472D4BBA.6080404@mcs.anl.gov> References: <472D4BBA.6080404@mcs.anl.gov> Message-ID: On Sat, 3 Nov 2007, Michael Wilde wrote: > Also, ideas why the swift log file shows times an hour behind? I think they aren't wrong. The UTC offset is listed as -6. If you're expecting the times to be Chicago local times then they would be an hour different and there would be a -5 UTC offset. Most likely this is caused by an outdated Java that isn't aware of the US federal Energy Policy Act of 2005 (I've encountered at least one such this week) and believes that US daylight savings time ended last week. However, as of tomorrow, all will be rectified as Chicago really will be using UTC-6. -- From hategan at mcs.anl.gov Sat Nov 3 23:59:32 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 03 Nov 2007 23:59:32 -0500 Subject: [Swift-devel] Error in syncing job start with input file availability? In-Reply-To: <472D4BBA.6080404@mcs.anl.gov> References: <472D4BBA.6080404@mcs.anl.gov> Message-ID: <1194152372.10816.2.camel@blabla.mcs.anl.gov> Looks to me like a problem with PBS rather than something with the jobs. So I don't think this is worth investigating. It belongs to the "random bad things happen on occasion" class of problems, for which we have restarts and scoring. Mihael On Sat, 2007-11-03 at 23:34 -0500, Michael Wilde wrote: > In the angle run in ~benc/swift-logs/wilde/run121, it looks to me like > theres a chance that a job attempted to start before its data was > visible to the node (not sure, just suspicious). > > Its a 5-job angle run. 4 jobs worked. The 5th job failed, the one for > index 1 of a 5-element input file array. > > The wf ran with ~wilde/swiftdata/* as the storage and work dir (2 > subdirs below that). So it was on NFS. > > All 5 input files are in the shared/ dir, but the failing job is the one > whose timestamp is last. (0, 2,3,4 worked; 1 failed) > > I also got 3 emails from PBS of the form: > > PBS Job Id: 1571647.tg-master.uc.teragrid.org > Job Name: STDIN > Aborted by PBS Server > Job cannot be executed > See Administrator for help > > all dated 8:05 PM, three consecutive job ids, *47, 48, 49. > > Q: Do these email messages indicate that the job was failed by PBS > before the app was started, or do these messages indicate a non-zero app > exit, eg, if its input file was missing? > > The input files on shared/ were dated: > > drwxr-xr-x 4 wilde allocate 512 2007-11-03 20:04:33.000000000 > -0500 _concurrent/ > -rw-r--r-- 1 wilde allocate 46747037 2007-11-03 20:04:52.000000000 > -0500 pc1.pcap > -rw-r--r-- 1 wilde allocate 46747037 2007-11-03 20:04:55.000000000 > -0500 pc2.pcap > -rw-r--r-- 1 wilde allocate 46747037 2007-11-03 20:04:52.000000000 > -0500 pc3.pcap > -rw-r--r-- 1 wilde allocate 46747037 2007-11-03 20:04:47.000000000 > -0500 pc4.pcap > -rw-r--r-- 1 wilde allocate 46747037 2007-11-03 20:04:51.000000000 > -0500 pc5.pcap > -rw-r--r-- 1 wilde allocate 813 2007-11-03 20:04:33.000000000 > -0500 seq.sh > -rw-r--r-- 1 wilde allocate 4848 2007-11-03 20:04:33.000000000 > -0500 wrapper.sh > > The awf3*.log file shows: > > 2007-11-03 19:04:52,400-0600 DEBUG vdl:dostagein FILE_STAGE_IN_END > file=file://localhost/pc1.pcap srchost=localhost srcdir= srcn\ > ame=pc1.pcap desthost=UC destdir=awf3-20071103-1904-2z266pk3/shared/ > provider=file > 2007-11-03 19:04:52,400-0600 INFO vdl:dostagein END > jobid=angle4-ujal0lji - Staging in finished > 2007-11-03 19:04:52,401-0600 DEBUG vdl:execute2 JOB_START > jobid=angle4-ujal0lji tr=angle4 arguments=[pc1.pcap, _concurrent/of-06\ > 6b25e3-b85f-45ce-a674-fd295fe1feb4--array//elt-0, > _concurrent/cf-6d786027-4199-47d5-897f-12df44978d24--array//elt-0] > tmpdir=awf3\ > -20071103-1904-2z266pk3/jobs/u/angle4-ujal0lji host=UC > > (Note that the logfile for some reason logs times 1 hour behind???) > > But the main suspicious thing above is that while the log shows stagin > complete for pc1.pcap at 4:52 past the hour, the ls shows the file mod > date to be 4:55 past the hour, while the job was started (queued?) at 4:52. > > If the job happened to hit the PBS queue right at the time PBS was doing > a queue poll, it may have started right away, and somehow started before > file pc1.pcap was visible to the worker node. Im not sure what if > anything in the synchronization prevents this, especially if NFS > close-to-open consistency is broken. (Which we are very suspicious of on > this site and with Linux NFS in general). > > Lastly, i've run the identical workflow twice more now, and its worked > with no change both times. > > Any ideas or other explanations for what may have happened here? > > Also, ideas why the swift log file shows times an hour behind? > > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Sun Nov 4 00:18:29 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 4 Nov 2007 05:18:29 +0000 (GMT) Subject: [Swift-devel] Error in syncing job start with input file availability? In-Reply-To: <472D4BBA.6080404@mcs.anl.gov> References: <472D4BBA.6080404@mcs.anl.gov> Message-ID: On Sat, 3 Nov 2007, Michael Wilde wrote: > Q: Do these email messages indicate that the job was failed by PBS before the > app was started, or do these messages indicate a non-zero app exit, eg, if its > input file was missing? I don't know what the different PBS errors mean. That job never finished as far as swift is concerned (or at least swift exited before logging anything) - perhaps you're running with lazy errors turned off (which is the default at the moment; I am undecided whether off or on is the best default). > -rw-r--r-- 1 wilde allocate 46747037 2007-11-03 20:04:52.000000000 -0500 > pc1.pcap > 2007-11-03 19:04:52,401-0600 DEBUG vdl:execute2 JOB_START > jobid=angle4-ujal0lji tr=angle4 arguments=[pc1.pcap, _concurrent/of-06\ > But the main suspicious thing above is that while the log shows stagin > complete for pc1.pcap at 4:52 past the hour, the ls shows the file mod date to > be 4:55 past the hour, while the job was started (queued?) at 4:52. mod date is 4:52. > If the job happened to hit the PBS queue right at the time PBS was doing a > queue poll, it may have started right away, and somehow started before file > pc1.pcap was visible to the worker node. Im not sure what if anything in the > synchronization prevents this, especially if NFS close-to-open consistency is > broken. (Which we are very suspicious of on this site and with Linux NFS in > general). What site? Can you use a different FS? -- From wilde at mcs.anl.gov Sun Nov 4 07:45:41 2007 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 04 Nov 2007 07:45:41 -0600 Subject: [Swift-devel] Error in syncing job start with input file availability? In-Reply-To: <1194152372.10816.2.camel@blabla.mcs.anl.gov> References: <472D4BBA.6080404@mcs.anl.gov> <1194152372.10816.2.camel@blabla.mcs.anl.gov> Message-ID: <472DCD05.4070609@mcs.anl.gov> On 11/4/07 12:18 AM, Ben Clifford wrote: >> But the main suspicious thing above is that while the log shows stagin >> complete for pc1.pcap at 4:52 past the hour, the ls shows the file mod date to >> be 4:55 past the hour, while the job was started (queued?) at 4:52. > > mod date is 4:52. I got my file and job mixed up. The file was pc2.pcap, the mod date was 4:55, but so was the job start time, so that look ok. My mistake. > > What site? Can you use a different FS? > uc-teragrid. I will experiment with both nfs and gpfs. Did you determine with Andrew which is faster? More reliable? On 11/3/07 11:59 PM, Mihael Hategan wrote: > Looks to me like a problem with PBS rather than something with the jobs. > So I don't think this is worth investigating. It belongs to the "random > bad things happen on occasion" class of problems, for which we have > restarts and scoring. Possibly. In this case the job was re-run twice (3 total, within a minute) and all three failed, all got the same PBS error message emailed to me. I agree, not worth investigating unless it happens more. grep JOB_START a*.log | grep pc2 2007-11-03 19:04:55,814-0600 DEBUG vdl:execute2 JOB_START jobid=angle4-tjal0lji tr=angle4 arguments=[pc2.pcap, _concurrent/of-066b25e3-b85f-45ce-a674-fd295fe1feb4--array//elt-1, _concurrent/cf-6d786027-4199-47d5-897f-12df44978d24--array//elt-1] tmpdir=awf3-20071103-1904-2z266pk3/jobs/t/angle4-tjal0lji host=UC 2007-11-03 19:05:29,495-0600 DEBUG vdl:execute2 JOB_START jobid=angle4-vjal0lji tr=angle4 arguments=[pc2.pcap, _concurrent/of-066b25e3-b85f-45ce-a674-fd295fe1feb4--array//elt-1, _concurrent/cf-6d786027-4199-47d5-897f-12df44978d24--array//elt-1] tmpdir=awf3-20071103-1904-2z266pk3/jobs/v/angle4-vjal0lji host=UC 2007-11-03 19:05:42,678-0600 DEBUG vdl:execute2 JOB_START jobid=angle4-xjal0lji tr=angle4 arguments=[pc2.pcap, _concurrent/of-066b25e3-b85f-45ce-a674-fd295fe1feb4--array//elt-1, _concurrent/cf-6d786027-4199-47d5-897f-12df44978d24--array//elt-1] tmpdir=awf3-20071103-1904-2z266pk3/jobs/x/angle4-xjal0lji host=UC vz$ grep EXCEPT a*.log 2007-11-03 19:05:28,567-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=angle4-tjal0lji - Application exception: No status file was found. Check the shared filesystem on UC 2007-11-03 19:05:41,754-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=angle4-vjal0lji - Application exception: No status file was found. Check the shared filesystem on UC 2007-11-03 19:05:55,048-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=angle4-xjal0lji - Application exception: No status file was found. Check the shared filesystem on UC - Mike > > Mihael > > On Sat, 2007-11-03 at 23:34 -0500, Michael Wilde wrote: >> In the angle run in ~benc/swift-logs/wilde/run121, it looks to me like >> theres a chance that a job attempted to start before its data was >> visible to the node (not sure, just suspicious). >> >> Its a 5-job angle run. 4 jobs worked. The 5th job failed, the one for >> index 1 of a 5-element input file array. >> >> The wf ran with ~wilde/swiftdata/* as the storage and work dir (2 >> subdirs below that). So it was on NFS. >> >> All 5 input files are in the shared/ dir, but the failing job is the one >> whose timestamp is last. (0, 2,3,4 worked; 1 failed) >> >> I also got 3 emails from PBS of the form: >> >> PBS Job Id: 1571647.tg-master.uc.teragrid.org >> Job Name: STDIN >> Aborted by PBS Server >> Job cannot be executed >> See Administrator for help >> >> all dated 8:05 PM, three consecutive job ids, *47, 48, 49. >> >> Q: Do these email messages indicate that the job was failed by PBS >> before the app was started, or do these messages indicate a non-zero app >> exit, eg, if its input file was missing? >> >> The input files on shared/ were dated: >> >> drwxr-xr-x 4 wilde allocate 512 2007-11-03 20:04:33.000000000 >> -0500 _concurrent/ >> -rw-r--r-- 1 wilde allocate 46747037 2007-11-03 20:04:52.000000000 >> -0500 pc1.pcap >> -rw-r--r-- 1 wilde allocate 46747037 2007-11-03 20:04:55.000000000 >> -0500 pc2.pcap >> -rw-r--r-- 1 wilde allocate 46747037 2007-11-03 20:04:52.000000000 >> -0500 pc3.pcap >> -rw-r--r-- 1 wilde allocate 46747037 2007-11-03 20:04:47.000000000 >> -0500 pc4.pcap >> -rw-r--r-- 1 wilde allocate 46747037 2007-11-03 20:04:51.000000000 >> -0500 pc5.pcap >> -rw-r--r-- 1 wilde allocate 813 2007-11-03 20:04:33.000000000 >> -0500 seq.sh >> -rw-r--r-- 1 wilde allocate 4848 2007-11-03 20:04:33.000000000 >> -0500 wrapper.sh >> >> The awf3*.log file shows: >> >> 2007-11-03 19:04:52,400-0600 DEBUG vdl:dostagein FILE_STAGE_IN_END >> file=file://localhost/pc1.pcap srchost=localhost srcdir= srcn\ >> ame=pc1.pcap desthost=UC destdir=awf3-20071103-1904-2z266pk3/shared/ >> provider=file >> 2007-11-03 19:04:52,400-0600 INFO vdl:dostagein END >> jobid=angle4-ujal0lji - Staging in finished >> 2007-11-03 19:04:52,401-0600 DEBUG vdl:execute2 JOB_START >> jobid=angle4-ujal0lji tr=angle4 arguments=[pc1.pcap, _concurrent/of-06\ >> 6b25e3-b85f-45ce-a674-fd295fe1feb4--array//elt-0, >> _concurrent/cf-6d786027-4199-47d5-897f-12df44978d24--array//elt-0] >> tmpdir=awf3\ >> -20071103-1904-2z266pk3/jobs/u/angle4-ujal0lji host=UC >> >> (Note that the logfile for some reason logs times 1 hour behind???) >> >> But the main suspicious thing above is that while the log shows stagin >> complete for pc1.pcap at 4:52 past the hour, the ls shows the file mod >> date to be 4:55 past the hour, while the job was started (queued?) at 4:52. >> >> If the job happened to hit the PBS queue right at the time PBS was doing >> a queue poll, it may have started right away, and somehow started before >> file pc1.pcap was visible to the worker node. Im not sure what if >> anything in the synchronization prevents this, especially if NFS >> close-to-open consistency is broken. (Which we are very suspicious of on >> this site and with Linux NFS in general). >> >> Lastly, i've run the identical workflow twice more now, and its worked >> with no change both times. >> >> Any ideas or other explanations for what may have happened here? >> >> Also, ideas why the swift log file shows times an hour behind? >> >> >> >> >> >> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > > From benc at hawaga.org.uk Sun Nov 4 07:51:53 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Sun, 4 Nov 2007 13:51:53 +0000 (GMT) Subject: [Swift-devel] Error in syncing job start with input file availability? In-Reply-To: <472DCD05.4070609@mcs.anl.gov> References: <472D4BBA.6080404@mcs.anl.gov> <1194152372.10816.2.camel@blabla.mcs.anl.gov> <472DCD05.4070609@mcs.anl.gov> Message-ID: On Sun, 4 Nov 2007, Michael Wilde wrote: > uc-teragrid. I will experiment with both nfs and gpfs. > Did you determine with Andrew which is faster? More reliable? We didn't try NFS so I don't really have any objective data about the two. Though it strikes me i shoul dcheck that the logging is tracking where on the remote site the run directory is placed so that we can tell later on. -- From wilde at mcs.anl.gov Sun Nov 4 11:03:53 2007 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 04 Nov 2007 11:03:53 -0600 Subject: [Swift-devel] GT2 service down on uc-teragrid Message-ID: <472DFB79.2000708@mcs.anl.gov> Ti, TG-Help, I'm unable to submit globus jobs to tg-grid via GRAM2: vz$ globus-job-run tg-grid.uc.teragrid.org /bin/hostname GRAM Job submission failed because the connection to the server failed (check host and port) (error code 12) vz$ I can ping the host and telnet can connect to port 2119. It was running OK last night around 10PM. We're using this machine for SC07 tutorials, demos and challenge competitions, so anything you could do to resolve quickly would be appreciated. (Any chance the DST change affected it?) Thanks, Mike -- Michael Wilde Computation Institute University of Chicago and Argonne National Laboratory 5640 S. Ellis Av, Suite 405 Chicago, IL 60637 USA 708-203-9548 From wilde at mcs.anl.gov Sun Nov 4 19:37:37 2007 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 04 Nov 2007 19:37:37 -0600 Subject: [Swift-devel] Kickstart runs on localhost are failing Message-ID: <472E73E1.9030607@mcs.anl.gov> I get job exceptions when I run with kickstart on localhost, regardless of whether clustered or not. The jobs seem to run (3x each) but fail each time. First time gets "Application exception: Missing argument jobdir", 2nd & 3rd get "Application exception: The cache already contains localhost:awf4-20071104-1843-ds8hn11a..." Clustered run is in run137, unclustered in run138 The latter log dir has a file swiftdata.find.out which lists all the files in my data dir (has a local/ branch at the top for localhost jobs). Error in both cases is below. Will try next doing kickstart in both ways via gram. - Mike 2007-11-04 18:47:40,946-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=angle4-cgqcmmji - Application exception: Missing argument jobdir for sys:element(rhost, wfdir, jobid, jobdir) 2007-11-04 18:47:41,085-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, identity=urn:0-2-1194223436415) setting status to Failed org.globus.cog.abstraction.impl.file.FileNotFoundException: angle4-cgqcmmji-stderr.txt not found. 2007-11-04 18:47:41,344-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, identity=urn:0-2-1194223436424) setting status to Failed org.globus.cog.abstraction.impl.file.FileNotFoundException: angle4-cgqcmmji-stdout.txt not found. 2007-11-04 18:47:41,503-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=angle4-bgqcmmji - Application exception: Missing argument jobdir for sys:element(rhost, wfdir, jobid, jobdir) 2007-11-04 18:47:41,553-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, identity=urn:0-1-1194223436458) setting status to Failed org.globus.cog.abstraction.impl.file.FileNotFoundException: angle4-bgqcmmji-stderr.txt not found. 2007-11-04 18:47:41,638-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, identity=urn:0-1-1194223436467) setting status to Failed org.globus.cog.abstraction.impl.file.FileNotFoundException: angle4-bgqcmmji-stdout.txt not found. 2007-11-04 18:47:41,882-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=angle4-agqcmmji - Application exception: Missing argument jobdir for sys:element(rhost, wfdir, jobid, jobdir) 2007-11-04 18:47:41,987-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, identity=urn:0-3-1194223436500) setting status to Failed org.globus.cog.abstraction.impl.file.FileNotFoundException: angle4-agqcmmji-stderr.txt not found. 2007-11-04 18:47:42,047-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, identity=urn:0-3-1194223436507) setting status to Failed org.globus.cog.abstraction.impl.file.FileNotFoundException: angle4-agqcmmji-stdout.txt not found. 2007-11-04 18:51:18,439-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=angle4-dgqcmmji - Application exception: The cache already contains localhost:awf4-20071104-1843-ds8hn11a/shared/cf0000.angle. 2007-11-04 18:51:18,628-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, identity=urn:0-2-1194223436543) setting status to Failed org.globus.cog.abstraction.impl.file.FileNotFoundException: angle4-dgqcmmji-stderr.txt not found. 2007-11-04 18:51:18,762-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, identity=urn:0-2-1194223436550) setting status to Failed org.globus.cog.abstraction.impl.file.FileNotFoundException: angle4-dgqcmmji-stdout.txt not found. 2007-11-04 18:51:25,976-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=angle4-egqcmmji - Application exception: The cache already contains localhost:awf4-20071104-1843-ds8hn11a/shared/of0002.angle. 2007-11-04 18:51:26,401-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, identity=urn:0-1-1194223436585) setting status to Failed org.globus.cog.abstraction.impl.file.FileNotFoundException: angle4-egqcmmji-stderr.txt not found. 2007-11-04 18:51:26,726-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, identity=urn:0-1-1194223436592) setting status to Failed org.globus.cog.abstraction.impl.file.FileNotFoundException: angle4-egqcmmji-stdout.txt not found. 2007-11-04 18:51:28,040-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=angle4-fgqcmmji - Application exception: The cache already contains localhost:awf4-20071104-1843-ds8hn11a/shared/cf0001.angle. 2007-11-04 18:51:28,492-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, identity=urn:0-3-1194223436627) setting status to Failed org.globus.cog.abstraction.impl.file.FileNotFoundException: angle4-fgqcmmji-stderr.txt not found. 2007-11-04 18:51:28,816-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, identity=urn:0-3-1194223436634) setting status to Failed org.globus.cog.abstraction.impl.file.FileNotFoundException: angle4-fgqcmmji-stdout.txt not found. 2007-11-04 18:54:44,088-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION jobid=angle4-hgqcmmji - Application exception: The cache already contains localhost:awf4-20071104-1843-ds8hn11a/shared/of0002.angle. 2007-11-04 18:54:44,440-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, identity=urn:0-1-1194223436670) setting status to Failed org.globus.cog.abstraction.impl.file.FileNotFoundException: angle4-hgqcmmji-stderr.txt not found. 2007-11-04 18:54:44,652-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, identity=urn:0-1-1194223436677) setting status to Failed org.globus.cog.abstraction.impl.file.FileNotFoundException: angle4-hgqcmmji-stdout.txt not found. 2007-11-04 18:54:44,741-0600 DEBUG VDL2ExecutionContext Exception in angle4: Exception in angle4: sys:exception @ vdl-int.k, line: 423 at org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) 2007-11-04 18:54:46,190-0600 INFO ExecutionContext Detailed exception: Exception in angle4: sys:exception @ vdl-int.k, line: 423 at org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) From benc at hawaga.org.uk Sun Nov 4 19:40:04 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 5 Nov 2007 01:40:04 +0000 (GMT) Subject: [Swift-devel] Kickstart runs on localhost are failing In-Reply-To: <472E73E1.9030607@mcs.anl.gov> References: <472E73E1.9030607@mcs.anl.gov> Message-ID: try r1456 - that has a kickstart record transfer fix. > The jobs seem to run (3x each) but fail each time. First time gets > "Application exception: Missing argument jobdir" r1456 fixes this. > , 2nd & 3rd get "Application > exception: The cache already contains > localhost:awf4-20071104-1843-ds8hn11a..." however, that suggests that there's a cache management problem now that I will investigate. -- From wilde at mcs.anl.gov Sun Nov 4 20:38:22 2007 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 04 Nov 2007 20:38:22 -0600 Subject: [Swift-devel] Error in syncing job start with input file availability? In-Reply-To: <472DCD05.4070609@mcs.anl.gov> References: <472D4BBA.6080404@mcs.anl.gov> <1194152372.10816.2.camel@blabla.mcs.anl.gov> <472DCD05.4070609@mcs.anl.gov> Message-ID: <472E821E.50109@mcs.anl.gov> A very similar error occured a bit ago, tonight. Its in ~benc/swiftlogs/wilde/run143 along with the same 3 PBS emailed errors in pbs.errors.out. This was using r1453. Upgrading too 1456 now. Just fyi - dont bother with this till we see it with latest release. Also, this run was with kickstart; yesterdays was not. - Mike On 11/4/07 7:45 AM, Michael Wilde wrote: > On 11/4/07 12:18 AM, Ben Clifford wrote: > > >> But the main suspicious thing above is that while the log shows stagin > >> complete for pc1.pcap at 4:52 past the hour, the ls shows the file > mod date to > >> be 4:55 past the hour, while the job was started (queued?) at 4:52. > > > > mod date is 4:52. > > I got my file and job mixed up. The file was pc2.pcap, the mod date was > 4:55, but so was the job start time, so that look ok. My mistake. > > > > > What site? Can you use a different FS? > > > > uc-teragrid. I will experiment with both nfs and gpfs. > Did you determine with Andrew which is faster? More reliable? > > On 11/3/07 11:59 PM, Mihael Hategan wrote: >> Looks to me like a problem with PBS rather than something with the jobs. >> So I don't think this is worth investigating. It belongs to the "random >> bad things happen on occasion" class of problems, for which we have >> restarts and scoring. > > Possibly. In this case the job was re-run twice (3 total, within a > minute) and all three failed, all got the same PBS error message emailed > to me. I agree, not worth investigating unless it happens more. > > grep JOB_START a*.log | grep pc2 > > 2007-11-03 19:04:55,814-0600 DEBUG vdl:execute2 JOB_START > jobid=angle4-tjal0lji tr=angle4 arguments=[pc2.pcap, > _concurrent/of-066b25e3-b85f-45ce-a674-fd295fe1feb4--array//elt-1, > _concurrent/cf-6d786027-4199-47d5-897f-12df44978d24--array//elt-1] > tmpdir=awf3-20071103-1904-2z266pk3/jobs/t/angle4-tjal0lji host=UC > 2007-11-03 19:05:29,495-0600 DEBUG vdl:execute2 JOB_START > jobid=angle4-vjal0lji tr=angle4 arguments=[pc2.pcap, > _concurrent/of-066b25e3-b85f-45ce-a674-fd295fe1feb4--array//elt-1, > _concurrent/cf-6d786027-4199-47d5-897f-12df44978d24--array//elt-1] > tmpdir=awf3-20071103-1904-2z266pk3/jobs/v/angle4-vjal0lji host=UC > 2007-11-03 19:05:42,678-0600 DEBUG vdl:execute2 JOB_START > jobid=angle4-xjal0lji tr=angle4 arguments=[pc2.pcap, > _concurrent/of-066b25e3-b85f-45ce-a674-fd295fe1feb4--array//elt-1, > _concurrent/cf-6d786027-4199-47d5-897f-12df44978d24--array//elt-1] > tmpdir=awf3-20071103-1904-2z266pk3/jobs/x/angle4-xjal0lji host=UC > vz$ > > grep EXCEPT a*.log > > 2007-11-03 19:05:28,567-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION > jobid=angle4-tjal0lji - Application exception: No status file was found. > Check the shared filesystem on UC > 2007-11-03 19:05:41,754-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION > jobid=angle4-vjal0lji - Application exception: No status file was found. > Check the shared filesystem on UC > 2007-11-03 19:05:55,048-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION > jobid=angle4-xjal0lji - Application exception: No status file was found. > Check the shared filesystem on UC > > - Mike > >> >> Mihael >> >> On Sat, 2007-11-03 at 23:34 -0500, Michael Wilde wrote: >>> In the angle run in ~benc/swift-logs/wilde/run121, it looks to me >>> like theres a chance that a job attempted to start before its data >>> was visible to the node (not sure, just suspicious). >>> >>> Its a 5-job angle run. 4 jobs worked. The 5th job failed, the one >>> for index 1 of a 5-element input file array. >>> >>> The wf ran with ~wilde/swiftdata/* as the storage and work dir (2 >>> subdirs below that). So it was on NFS. >>> >>> All 5 input files are in the shared/ dir, but the failing job is the >>> one whose timestamp is last. (0, 2,3,4 worked; 1 failed) >>> >>> I also got 3 emails from PBS of the form: >>> >>> PBS Job Id: 1571647.tg-master.uc.teragrid.org >>> Job Name: STDIN >>> Aborted by PBS Server >>> Job cannot be executed >>> See Administrator for help >>> >>> all dated 8:05 PM, three consecutive job ids, *47, 48, 49. >>> >>> Q: Do these email messages indicate that the job was failed by PBS >>> before the app was started, or do these messages indicate a non-zero >>> app exit, eg, if its input file was missing? >>> >>> The input files on shared/ were dated: >>> >>> drwxr-xr-x 4 wilde allocate 512 2007-11-03 >>> 20:04:33.000000000 -0500 _concurrent/ >>> -rw-r--r-- 1 wilde allocate 46747037 2007-11-03 >>> 20:04:52.000000000 -0500 pc1.pcap >>> -rw-r--r-- 1 wilde allocate 46747037 2007-11-03 >>> 20:04:55.000000000 -0500 pc2.pcap >>> -rw-r--r-- 1 wilde allocate 46747037 2007-11-03 >>> 20:04:52.000000000 -0500 pc3.pcap >>> -rw-r--r-- 1 wilde allocate 46747037 2007-11-03 >>> 20:04:47.000000000 -0500 pc4.pcap >>> -rw-r--r-- 1 wilde allocate 46747037 2007-11-03 >>> 20:04:51.000000000 -0500 pc5.pcap >>> -rw-r--r-- 1 wilde allocate 813 2007-11-03 >>> 20:04:33.000000000 -0500 seq.sh >>> -rw-r--r-- 1 wilde allocate 4848 2007-11-03 >>> 20:04:33.000000000 -0500 wrapper.sh >>> >>> The awf3*.log file shows: >>> >>> 2007-11-03 19:04:52,400-0600 DEBUG vdl:dostagein FILE_STAGE_IN_END >>> file=file://localhost/pc1.pcap srchost=localhost srcdir= srcn\ >>> ame=pc1.pcap desthost=UC destdir=awf3-20071103-1904-2z266pk3/shared/ >>> provider=file >>> 2007-11-03 19:04:52,400-0600 INFO vdl:dostagein END >>> jobid=angle4-ujal0lji - Staging in finished >>> 2007-11-03 19:04:52,401-0600 DEBUG vdl:execute2 JOB_START >>> jobid=angle4-ujal0lji tr=angle4 arguments=[pc1.pcap, _concurrent/of-06\ >>> 6b25e3-b85f-45ce-a674-fd295fe1feb4--array//elt-0, >>> _concurrent/cf-6d786027-4199-47d5-897f-12df44978d24--array//elt-0] >>> tmpdir=awf3\ >>> -20071103-1904-2z266pk3/jobs/u/angle4-ujal0lji host=UC >>> >>> (Note that the logfile for some reason logs times 1 hour behind???) >>> >>> But the main suspicious thing above is that while the log shows >>> stagin complete for pc1.pcap at 4:52 past the hour, the ls shows the >>> file mod date to be 4:55 past the hour, while the job was started >>> (queued?) at 4:52. >>> >>> If the job happened to hit the PBS queue right at the time PBS was >>> doing a queue poll, it may have started right away, and somehow >>> started before file pc1.pcap was visible to the worker node. Im not >>> sure what if anything in the synchronization prevents this, >>> especially if NFS close-to-open consistency is broken. (Which we are >>> very suspicious of on this site and with Linux NFS in general). >>> >>> Lastly, i've run the identical workflow twice more now, and its >>> worked with no change both times. >>> >>> Any ideas or other explanations for what may have happened here? >>> >>> Also, ideas why the swift log file shows times an hour behind? >>> >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >> >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From benc at hawaga.org.uk Sun Nov 4 20:53:02 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 5 Nov 2007 02:53:02 +0000 (GMT) Subject: [Swift-devel] Error in syncing job start with input file availability? In-Reply-To: <472E821E.50109@mcs.anl.gov> References: <472D4BBA.6080404@mcs.anl.gov> <1194152372.10816.2.camel@blabla.mcs.anl.gov> <472DCD05.4070609@mcs.anl.gov> <472E821E.50109@mcs.anl.gov> Message-ID: On Sun, 4 Nov 2007, Michael Wilde wrote: > A very similar error occured a bit ago, tonight. > Its in ~benc/swiftlogs/wilde/run143 not that I can see... benc at terminable:~/swift-logs !1019 $ find . -name run\*143 benc at terminable:~/swift-logs !1020 $ -- From hategan at mcs.anl.gov Sun Nov 4 21:07:50 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 04 Nov 2007 21:07:50 -0600 Subject: [Swift-devel] Kickstart runs on localhost are failing In-Reply-To: <472E73E1.9030607@mcs.anl.gov> References: <472E73E1.9030607@mcs.anl.gov> Message-ID: <1194232070.6373.1.camel@blabla.mcs.anl.gov> On Sun, 2007-11-04 at 19:37 -0600, Michael Wilde wrote: > I get job exceptions when I run with kickstart on localhost, > regardless of whether clustered or not. > > The jobs seem to run (3x each) but fail each time. First time gets > "Application exception: Missing argument jobdir", 2nd & 3rd get > "Application exception: The cache already contains > localhost:awf4-20071104-1843-ds8hn11a..." That probably shouldn't happen unless you're trying to assign to the same variable twice. Does this work without kickstart? > > Clustered run is in run137, unclustered in run138 > The latter log dir has a file swiftdata.find.out which lists all the > files in my data dir (has a local/ branch at the top for localhost jobs). > > Error in both cases is below. > > Will try next doing kickstart in both ways via gram. > > - Mike > > 2007-11-04 18:47:40,946-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION > jobid=angle4-cgqcmmji - Application exception: Missing argument jobdir > for sys:element(rhost, wfdir, jobid, jobdir) > 2007-11-04 18:47:41,085-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > identity=urn:0-2-1194223436415) setting status to Failed > org.globus.cog.abstraction.impl.file.FileNotFoundException: > angle4-cgqcmmji-stderr.txt not found. > 2007-11-04 18:47:41,344-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > identity=urn:0-2-1194223436424) setting status to Failed > org.globus.cog.abstraction.impl.file.FileNotFoundException: > angle4-cgqcmmji-stdout.txt not found. > 2007-11-04 18:47:41,503-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION > jobid=angle4-bgqcmmji - Application exception: Missing argument jobdir > for sys:element(rhost, wfdir, jobid, jobdir) > 2007-11-04 18:47:41,553-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > identity=urn:0-1-1194223436458) setting status to Failed > org.globus.cog.abstraction.impl.file.FileNotFoundException: > angle4-bgqcmmji-stderr.txt not found. > 2007-11-04 18:47:41,638-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > identity=urn:0-1-1194223436467) setting status to Failed > org.globus.cog.abstraction.impl.file.FileNotFoundException: > angle4-bgqcmmji-stdout.txt not found. > 2007-11-04 18:47:41,882-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION > jobid=angle4-agqcmmji - Application exception: Missing argument jobdir > for sys:element(rhost, wfdir, jobid, jobdir) > 2007-11-04 18:47:41,987-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > identity=urn:0-3-1194223436500) setting status to Failed > org.globus.cog.abstraction.impl.file.FileNotFoundException: > angle4-agqcmmji-stderr.txt not found. > 2007-11-04 18:47:42,047-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > identity=urn:0-3-1194223436507) setting status to Failed > org.globus.cog.abstraction.impl.file.FileNotFoundException: > angle4-agqcmmji-stdout.txt not found. > 2007-11-04 18:51:18,439-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION > jobid=angle4-dgqcmmji - Application exception: The cache already > contains localhost:awf4-20071104-1843-ds8hn11a/shared/cf0000.angle. > 2007-11-04 18:51:18,628-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > identity=urn:0-2-1194223436543) setting status to Failed > org.globus.cog.abstraction.impl.file.FileNotFoundException: > angle4-dgqcmmji-stderr.txt not found. > 2007-11-04 18:51:18,762-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > identity=urn:0-2-1194223436550) setting status to Failed > org.globus.cog.abstraction.impl.file.FileNotFoundException: > angle4-dgqcmmji-stdout.txt not found. > 2007-11-04 18:51:25,976-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION > jobid=angle4-egqcmmji - Application exception: The cache already > contains localhost:awf4-20071104-1843-ds8hn11a/shared/of0002.angle. > 2007-11-04 18:51:26,401-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > identity=urn:0-1-1194223436585) setting status to Failed > org.globus.cog.abstraction.impl.file.FileNotFoundException: > angle4-egqcmmji-stderr.txt not found. > 2007-11-04 18:51:26,726-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > identity=urn:0-1-1194223436592) setting status to Failed > org.globus.cog.abstraction.impl.file.FileNotFoundException: > angle4-egqcmmji-stdout.txt not found. > 2007-11-04 18:51:28,040-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION > jobid=angle4-fgqcmmji - Application exception: The cache already > contains localhost:awf4-20071104-1843-ds8hn11a/shared/cf0001.angle. > 2007-11-04 18:51:28,492-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > identity=urn:0-3-1194223436627) setting status to Failed > org.globus.cog.abstraction.impl.file.FileNotFoundException: > angle4-fgqcmmji-stderr.txt not found. > 2007-11-04 18:51:28,816-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > identity=urn:0-3-1194223436634) setting status to Failed > org.globus.cog.abstraction.impl.file.FileNotFoundException: > angle4-fgqcmmji-stdout.txt not found. > 2007-11-04 18:54:44,088-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION > jobid=angle4-hgqcmmji - Application exception: The cache already > contains localhost:awf4-20071104-1843-ds8hn11a/shared/of0002.angle. > 2007-11-04 18:54:44,440-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > identity=urn:0-1-1194223436670) setting status to Failed > org.globus.cog.abstraction.impl.file.FileNotFoundException: > angle4-hgqcmmji-stderr.txt not found. > 2007-11-04 18:54:44,652-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > identity=urn:0-1-1194223436677) setting status to Failed > org.globus.cog.abstraction.impl.file.FileNotFoundException: > angle4-hgqcmmji-stdout.txt not found. > 2007-11-04 18:54:44,741-0600 DEBUG VDL2ExecutionContext Exception in angle4: > Exception in angle4: > sys:exception @ vdl-int.k, line: 423 > at > org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) > 2007-11-04 18:54:46,190-0600 INFO ExecutionContext Detailed exception: > Exception in angle4: > sys:exception @ vdl-int.k, line: 423 > at > org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From hategan at mcs.anl.gov Sun Nov 4 21:15:39 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 04 Nov 2007 21:15:39 -0600 Subject: [Swift-devel] Kickstart runs on localhost are failing In-Reply-To: <1194232070.6373.1.camel@blabla.mcs.anl.gov> References: <472E73E1.9030607@mcs.anl.gov> <1194232070.6373.1.camel@blabla.mcs.anl.gov> Message-ID: <1194232539.6373.3.camel@blabla.mcs.anl.gov> On Sun, 2007-11-04 at 21:07 -0600, Mihael Hategan wrote: > On Sun, 2007-11-04 at 19:37 -0600, Michael Wilde wrote: > > I get job exceptions when I run with kickstart on localhost, > > regardless of whether clustered or not. > > > > The jobs seem to run (3x each) but fail each time. First time gets > > "Application exception: Missing argument jobdir", 2nd & 3rd get > > "Application exception: The cache already contains > > localhost:awf4-20071104-1843-ds8hn11a..." > > That probably shouldn't happen unless you're trying to assign to the > same variable twice. Does this work without kickstart? Where "shouldn't" should be interpreted as "unless there's a bug", which isn't necessarily unlikely. > > > > > Clustered run is in run137, unclustered in run138 > > The latter log dir has a file swiftdata.find.out which lists all the > > files in my data dir (has a local/ branch at the top for localhost jobs). > > > > Error in both cases is below. > > > > Will try next doing kickstart in both ways via gram. > > > > - Mike > > > > 2007-11-04 18:47:40,946-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION > > jobid=angle4-cgqcmmji - Application exception: Missing argument jobdir > > for sys:element(rhost, wfdir, jobid, jobdir) > > 2007-11-04 18:47:41,085-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > > identity=urn:0-2-1194223436415) setting status to Failed > > org.globus.cog.abstraction.impl.file.FileNotFoundException: > > angle4-cgqcmmji-stderr.txt not found. > > 2007-11-04 18:47:41,344-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > > identity=urn:0-2-1194223436424) setting status to Failed > > org.globus.cog.abstraction.impl.file.FileNotFoundException: > > angle4-cgqcmmji-stdout.txt not found. > > 2007-11-04 18:47:41,503-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION > > jobid=angle4-bgqcmmji - Application exception: Missing argument jobdir > > for sys:element(rhost, wfdir, jobid, jobdir) > > 2007-11-04 18:47:41,553-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > > identity=urn:0-1-1194223436458) setting status to Failed > > org.globus.cog.abstraction.impl.file.FileNotFoundException: > > angle4-bgqcmmji-stderr.txt not found. > > 2007-11-04 18:47:41,638-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > > identity=urn:0-1-1194223436467) setting status to Failed > > org.globus.cog.abstraction.impl.file.FileNotFoundException: > > angle4-bgqcmmji-stdout.txt not found. > > 2007-11-04 18:47:41,882-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION > > jobid=angle4-agqcmmji - Application exception: Missing argument jobdir > > for sys:element(rhost, wfdir, jobid, jobdir) > > 2007-11-04 18:47:41,987-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > > identity=urn:0-3-1194223436500) setting status to Failed > > org.globus.cog.abstraction.impl.file.FileNotFoundException: > > angle4-agqcmmji-stderr.txt not found. > > 2007-11-04 18:47:42,047-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > > identity=urn:0-3-1194223436507) setting status to Failed > > org.globus.cog.abstraction.impl.file.FileNotFoundException: > > angle4-agqcmmji-stdout.txt not found. > > 2007-11-04 18:51:18,439-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION > > jobid=angle4-dgqcmmji - Application exception: The cache already > > contains localhost:awf4-20071104-1843-ds8hn11a/shared/cf0000.angle. > > 2007-11-04 18:51:18,628-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > > identity=urn:0-2-1194223436543) setting status to Failed > > org.globus.cog.abstraction.impl.file.FileNotFoundException: > > angle4-dgqcmmji-stderr.txt not found. > > 2007-11-04 18:51:18,762-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > > identity=urn:0-2-1194223436550) setting status to Failed > > org.globus.cog.abstraction.impl.file.FileNotFoundException: > > angle4-dgqcmmji-stdout.txt not found. > > 2007-11-04 18:51:25,976-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION > > jobid=angle4-egqcmmji - Application exception: The cache already > > contains localhost:awf4-20071104-1843-ds8hn11a/shared/of0002.angle. > > 2007-11-04 18:51:26,401-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > > identity=urn:0-1-1194223436585) setting status to Failed > > org.globus.cog.abstraction.impl.file.FileNotFoundException: > > angle4-egqcmmji-stderr.txt not found. > > 2007-11-04 18:51:26,726-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > > identity=urn:0-1-1194223436592) setting status to Failed > > org.globus.cog.abstraction.impl.file.FileNotFoundException: > > angle4-egqcmmji-stdout.txt not found. > > 2007-11-04 18:51:28,040-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION > > jobid=angle4-fgqcmmji - Application exception: The cache already > > contains localhost:awf4-20071104-1843-ds8hn11a/shared/cf0001.angle. > > 2007-11-04 18:51:28,492-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > > identity=urn:0-3-1194223436627) setting status to Failed > > org.globus.cog.abstraction.impl.file.FileNotFoundException: > > angle4-fgqcmmji-stderr.txt not found. > > 2007-11-04 18:51:28,816-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > > identity=urn:0-3-1194223436634) setting status to Failed > > org.globus.cog.abstraction.impl.file.FileNotFoundException: > > angle4-fgqcmmji-stdout.txt not found. > > 2007-11-04 18:54:44,088-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION > > jobid=angle4-hgqcmmji - Application exception: The cache already > > contains localhost:awf4-20071104-1843-ds8hn11a/shared/of0002.angle. > > 2007-11-04 18:54:44,440-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > > identity=urn:0-1-1194223436670) setting status to Failed > > org.globus.cog.abstraction.impl.file.FileNotFoundException: > > angle4-hgqcmmji-stderr.txt not found. > > 2007-11-04 18:54:44,652-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > > identity=urn:0-1-1194223436677) setting status to Failed > > org.globus.cog.abstraction.impl.file.FileNotFoundException: > > angle4-hgqcmmji-stdout.txt not found. > > 2007-11-04 18:54:44,741-0600 DEBUG VDL2ExecutionContext Exception in angle4: > > Exception in angle4: > > sys:exception @ vdl-int.k, line: 423 > > at > > org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) > > 2007-11-04 18:54:46,190-0600 INFO ExecutionContext Detailed exception: > > Exception in angle4: > > sys:exception @ vdl-int.k, line: 423 > > at > > org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From wilde at mcs.anl.gov Sun Nov 4 21:20:40 2007 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 04 Nov 2007 21:20:40 -0600 Subject: [Swift-devel] Jobs being aborted by PBS server on tg-grid.uc.teragrid.org Message-ID: <472E8C08.308@mcs.anl.gov> Im starting to see more frequent problems like this. Happened once last night to 3 consecutive jobs, and tonight happened twice, to 6 jobs. Ti, could you look in the PBS logs, possibly on the related node(s) and see if its looking like a problem on tg-uc or on our side? Thanks, Mike 11/3 8:05 PM - 3 failures Job IDs 1571647, 48, & 49 11/4 7:46 PM - 3 failures Job IDs 1572031, 33, & 34 11/4 8:56 - 8:57 PM 1572040, 42, 43 All errors have the format below. Swift retries failing jobs 3 times, hence the groups of 3 above. -------- Original Message -------- Subject: PBS JOB 1572043.tg-master.uc.teragrid.org Date: Sun, 4 Nov 2007 20:57:11 -0600 (CST) From: adm at tg-master.uc.teragrid.org (root) To: wilde at tg-grid1.uc.teragrid.org PBS Job Id: 1572043.tg-master.uc.teragrid.org Job Name: STDIN Aborted by PBS Server Job cannot be executed See Administrator for help From wilde at mcs.anl.gov Sun Nov 4 21:26:17 2007 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 04 Nov 2007 21:26:17 -0600 Subject: [Swift-devel] Kickstart runs on localhost are failing In-Reply-To: <1194232070.6373.1.camel@blabla.mcs.anl.gov> References: <472E73E1.9030607@mcs.anl.gov> <1194232070.6373.1.camel@blabla.mcs.anl.gov> Message-ID: <472E8D59.8020206@mcs.anl.gov> [resending to cc swift-devel] On 11/4/07 9:07 PM, Mihael Hategan wrote: > On Sun, 2007-11-04 at 19:37 -0600, Michael Wilde wrote: >> I get job exceptions when I run with kickstart on localhost, >> regardless of whether clustered or not. >> >> The jobs seem to run (3x each) but fail each time. First time gets >> "Application exception: Missing argument jobdir", 2nd & 3rd get >> "Application exception: The cache already contains >> localhost:awf4-20071104-1843-ds8hn11a..." > > That probably shouldn't happen unless you're trying to assign to the > same variable twice. Does this work without kickstart? Yes, it works without kickstart (r1453) Trying again on r1456. It looked to me like the "cache already contains" error was a result of the first failure (which Ben thinks he's fixed in 1456 if I understand right) leaving the cache in a state where the retry gets confused. I should note that in all these cases, I got all the output, so the job runs despite the first error, likely causing the duplicate cache entry problems. - Mike > >> Clustered run is in run137, unclustered in run138 >> The latter log dir has a file swiftdata.find.out which lists all the >> files in my data dir (has a local/ branch at the top for localhost jobs). >> >> Error in both cases is below. >> >> Will try next doing kickstart in both ways via gram. >> >> - Mike >> >> 2007-11-04 18:47:40,946-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION >> jobid=angle4-cgqcmmji - Application exception: Missing argument jobdir >> for sys:element(rhost, wfdir, jobid, jobdir) >> 2007-11-04 18:47:41,085-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, >> identity=urn:0-2-1194223436415) setting status to Failed >> org.globus.cog.abstraction.impl.file.FileNotFoundException: >> angle4-cgqcmmji-stderr.txt not found. >> 2007-11-04 18:47:41,344-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, >> identity=urn:0-2-1194223436424) setting status to Failed >> org.globus.cog.abstraction.impl.file.FileNotFoundException: >> angle4-cgqcmmji-stdout.txt not found. >> 2007-11-04 18:47:41,503-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION >> jobid=angle4-bgqcmmji - Application exception: Missing argument jobdir >> for sys:element(rhost, wfdir, jobid, jobdir) >> 2007-11-04 18:47:41,553-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, >> identity=urn:0-1-1194223436458) setting status to Failed >> org.globus.cog.abstraction.impl.file.FileNotFoundException: >> angle4-bgqcmmji-stderr.txt not found. >> 2007-11-04 18:47:41,638-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, >> identity=urn:0-1-1194223436467) setting status to Failed >> org.globus.cog.abstraction.impl.file.FileNotFoundException: >> angle4-bgqcmmji-stdout.txt not found. >> 2007-11-04 18:47:41,882-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION >> jobid=angle4-agqcmmji - Application exception: Missing argument jobdir >> for sys:element(rhost, wfdir, jobid, jobdir) >> 2007-11-04 18:47:41,987-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, >> identity=urn:0-3-1194223436500) setting status to Failed >> org.globus.cog.abstraction.impl.file.FileNotFoundException: >> angle4-agqcmmji-stderr.txt not found. >> 2007-11-04 18:47:42,047-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, >> identity=urn:0-3-1194223436507) setting status to Failed >> org.globus.cog.abstraction.impl.file.FileNotFoundException: >> angle4-agqcmmji-stdout.txt not found. >> 2007-11-04 18:51:18,439-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION >> jobid=angle4-dgqcmmji - Application exception: The cache already >> contains localhost:awf4-20071104-1843-ds8hn11a/shared/cf0000.angle. >> 2007-11-04 18:51:18,628-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, >> identity=urn:0-2-1194223436543) setting status to Failed >> org.globus.cog.abstraction.impl.file.FileNotFoundException: >> angle4-dgqcmmji-stderr.txt not found. >> 2007-11-04 18:51:18,762-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, >> identity=urn:0-2-1194223436550) setting status to Failed >> org.globus.cog.abstraction.impl.file.FileNotFoundException: >> angle4-dgqcmmji-stdout.txt not found. >> 2007-11-04 18:51:25,976-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION >> jobid=angle4-egqcmmji - Application exception: The cache already >> contains localhost:awf4-20071104-1843-ds8hn11a/shared/of0002.angle. >> 2007-11-04 18:51:26,401-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, >> identity=urn:0-1-1194223436585) setting status to Failed >> org.globus.cog.abstraction.impl.file.FileNotFoundException: >> angle4-egqcmmji-stderr.txt not found. >> 2007-11-04 18:51:26,726-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, >> identity=urn:0-1-1194223436592) setting status to Failed >> org.globus.cog.abstraction.impl.file.FileNotFoundException: >> angle4-egqcmmji-stdout.txt not found. >> 2007-11-04 18:51:28,040-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION >> jobid=angle4-fgqcmmji - Application exception: The cache already >> contains localhost:awf4-20071104-1843-ds8hn11a/shared/cf0001.angle. >> 2007-11-04 18:51:28,492-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, >> identity=urn:0-3-1194223436627) setting status to Failed >> org.globus.cog.abstraction.impl.file.FileNotFoundException: >> angle4-fgqcmmji-stderr.txt not found. >> 2007-11-04 18:51:28,816-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, >> identity=urn:0-3-1194223436634) setting status to Failed >> org.globus.cog.abstraction.impl.file.FileNotFoundException: >> angle4-fgqcmmji-stdout.txt not found. >> 2007-11-04 18:54:44,088-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION >> jobid=angle4-hgqcmmji - Application exception: The cache already >> contains localhost:awf4-20071104-1843-ds8hn11a/shared/of0002.angle. >> 2007-11-04 18:54:44,440-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, >> identity=urn:0-1-1194223436670) setting status to Failed >> org.globus.cog.abstraction.impl.file.FileNotFoundException: >> angle4-hgqcmmji-stderr.txt not found. >> 2007-11-04 18:54:44,652-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, >> identity=urn:0-1-1194223436677) setting status to Failed >> org.globus.cog.abstraction.impl.file.FileNotFoundException: >> angle4-hgqcmmji-stdout.txt not found. >> 2007-11-04 18:54:44,741-0600 DEBUG VDL2ExecutionContext Exception in angle4: >> Exception in angle4: >> sys:exception @ vdl-int.k, line: 423 >> at >> org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) >> 2007-11-04 18:54:46,190-0600 INFO ExecutionContext Detailed exception: >> Exception in angle4: >> sys:exception @ vdl-int.k, line: 423 >> at >> org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > > From hategan at mcs.anl.gov Sun Nov 4 21:32:04 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 04 Nov 2007 21:32:04 -0600 Subject: [Swift-devel] Kickstart runs on localhost are failing In-Reply-To: <472E8D59.8020206@mcs.anl.gov> References: <472E73E1.9030607@mcs.anl.gov> <1194232070.6373.1.camel@blabla.mcs.anl.gov> <472E8D59.8020206@mcs.anl.gov> Message-ID: <1194233524.7242.3.camel@blabla.mcs.anl.gov> On Sun, 2007-11-04 at 21:26 -0600, Michael Wilde wrote: > [resending to cc swift-devel] > > On 11/4/07 9:07 PM, Mihael Hategan wrote: > > On Sun, 2007-11-04 at 19:37 -0600, Michael Wilde wrote: > >> I get job exceptions when I run with kickstart on localhost, > >> regardless of whether clustered or not. > >> > >> The jobs seem to run (3x each) but fail each time. First time gets > >> "Application exception: Missing argument jobdir", 2nd & 3rd get > >> "Application exception: The cache already contains > >> localhost:awf4-20071104-1843-ds8hn11a..." > > > > That probably shouldn't happen unless you're trying to assign to the > > same variable twice. Does this work without kickstart? > > Yes, it works without kickstart (r1453) > Trying again on r1456. > > It looked to me like the "cache already contains" error was a result of > the first failure (which Ben thinks he's fixed in 1456 if I understand > right) leaving the cache in a state where the retry gets confused. I thought I made sure in some r that things are added to the cache transactionally (i.e. when it's known that no bad things can happen). Maybe I got something wrong. > > I should note that in all these cases, I got all the output, so the job > runs despite the first error, likely causing the duplicate cache entry > problems. Ah, I see. The failure occurs when dealing with kickstart which is after the files are added to the cache. I did get something wrong. > > - Mike > > > > >> Clustered run is in run137, unclustered in run138 > >> The latter log dir has a file swiftdata.find.out which lists all the > >> files in my data dir (has a local/ branch at the top for localhost jobs). > >> > >> Error in both cases is below. > >> > >> Will try next doing kickstart in both ways via gram. > >> > >> - Mike > >> > >> 2007-11-04 18:47:40,946-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION > >> jobid=angle4-cgqcmmji - Application exception: Missing argument jobdir > >> for sys:element(rhost, wfdir, jobid, jobdir) > >> 2007-11-04 18:47:41,085-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > >> identity=urn:0-2-1194223436415) setting status to Failed > >> org.globus.cog.abstraction.impl.file.FileNotFoundException: > >> angle4-cgqcmmji-stderr.txt not found. > >> 2007-11-04 18:47:41,344-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > >> identity=urn:0-2-1194223436424) setting status to Failed > >> org.globus.cog.abstraction.impl.file.FileNotFoundException: > >> angle4-cgqcmmji-stdout.txt not found. > >> 2007-11-04 18:47:41,503-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION > >> jobid=angle4-bgqcmmji - Application exception: Missing argument jobdir > >> for sys:element(rhost, wfdir, jobid, jobdir) > >> 2007-11-04 18:47:41,553-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > >> identity=urn:0-1-1194223436458) setting status to Failed > >> org.globus.cog.abstraction.impl.file.FileNotFoundException: > >> angle4-bgqcmmji-stderr.txt not found. > >> 2007-11-04 18:47:41,638-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > >> identity=urn:0-1-1194223436467) setting status to Failed > >> org.globus.cog.abstraction.impl.file.FileNotFoundException: > >> angle4-bgqcmmji-stdout.txt not found. > >> 2007-11-04 18:47:41,882-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION > >> jobid=angle4-agqcmmji - Application exception: Missing argument jobdir > >> for sys:element(rhost, wfdir, jobid, jobdir) > >> 2007-11-04 18:47:41,987-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > >> identity=urn:0-3-1194223436500) setting status to Failed > >> org.globus.cog.abstraction.impl.file.FileNotFoundException: > >> angle4-agqcmmji-stderr.txt not found. > >> 2007-11-04 18:47:42,047-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > >> identity=urn:0-3-1194223436507) setting status to Failed > >> org.globus.cog.abstraction.impl.file.FileNotFoundException: > >> angle4-agqcmmji-stdout.txt not found. > >> 2007-11-04 18:51:18,439-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION > >> jobid=angle4-dgqcmmji - Application exception: The cache already > >> contains localhost:awf4-20071104-1843-ds8hn11a/shared/cf0000.angle. > >> 2007-11-04 18:51:18,628-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > >> identity=urn:0-2-1194223436543) setting status to Failed > >> org.globus.cog.abstraction.impl.file.FileNotFoundException: > >> angle4-dgqcmmji-stderr.txt not found. > >> 2007-11-04 18:51:18,762-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > >> identity=urn:0-2-1194223436550) setting status to Failed > >> org.globus.cog.abstraction.impl.file.FileNotFoundException: > >> angle4-dgqcmmji-stdout.txt not found. > >> 2007-11-04 18:51:25,976-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION > >> jobid=angle4-egqcmmji - Application exception: The cache already > >> contains localhost:awf4-20071104-1843-ds8hn11a/shared/of0002.angle. > >> 2007-11-04 18:51:26,401-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > >> identity=urn:0-1-1194223436585) setting status to Failed > >> org.globus.cog.abstraction.impl.file.FileNotFoundException: > >> angle4-egqcmmji-stderr.txt not found. > >> 2007-11-04 18:51:26,726-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > >> identity=urn:0-1-1194223436592) setting status to Failed > >> org.globus.cog.abstraction.impl.file.FileNotFoundException: > >> angle4-egqcmmji-stdout.txt not found. > >> 2007-11-04 18:51:28,040-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION > >> jobid=angle4-fgqcmmji - Application exception: The cache already > >> contains localhost:awf4-20071104-1843-ds8hn11a/shared/cf0001.angle. > >> 2007-11-04 18:51:28,492-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > >> identity=urn:0-3-1194223436627) setting status to Failed > >> org.globus.cog.abstraction.impl.file.FileNotFoundException: > >> angle4-fgqcmmji-stderr.txt not found. > >> 2007-11-04 18:51:28,816-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > >> identity=urn:0-3-1194223436634) setting status to Failed > >> org.globus.cog.abstraction.impl.file.FileNotFoundException: > >> angle4-fgqcmmji-stdout.txt not found. > >> 2007-11-04 18:54:44,088-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION > >> jobid=angle4-hgqcmmji - Application exception: The cache already > >> contains localhost:awf4-20071104-1843-ds8hn11a/shared/of0002.angle. > >> 2007-11-04 18:54:44,440-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > >> identity=urn:0-1-1194223436670) setting status to Failed > >> org.globus.cog.abstraction.impl.file.FileNotFoundException: > >> angle4-hgqcmmji-stderr.txt not found. > >> 2007-11-04 18:54:44,652-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > >> identity=urn:0-1-1194223436677) setting status to Failed > >> org.globus.cog.abstraction.impl.file.FileNotFoundException: > >> angle4-hgqcmmji-stdout.txt not found. > >> 2007-11-04 18:54:44,741-0600 DEBUG VDL2ExecutionContext Exception in angle4: > >> Exception in angle4: > >> sys:exception @ vdl-int.k, line: 423 > >> at > >> org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) > >> 2007-11-04 18:54:46,190-0600 INFO ExecutionContext Detailed exception: > >> Exception in angle4: > >> sys:exception @ vdl-int.k, line: 423 > >> at > >> org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >> > > > > > From hategan at mcs.anl.gov Sun Nov 4 21:37:17 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 04 Nov 2007 21:37:17 -0600 Subject: [Swift-devel] Kickstart runs on localhost are failing In-Reply-To: <1194233524.7242.3.camel@blabla.mcs.anl.gov> References: <472E73E1.9030607@mcs.anl.gov> <1194232070.6373.1.camel@blabla.mcs.anl.gov> <472E8D59.8020206@mcs.anl.gov> <1194233524.7242.3.camel@blabla.mcs.anl.gov> Message-ID: <1194233838.7242.8.camel@blabla.mcs.anl.gov> > > > > It looked to me like the "cache already contains" error was a result of > > the first failure (which Ben thinks he's fixed in 1456 if I understand > > right) leaving the cache in a state where the retry gets confused. > > I thought I made sure in some r that things are added to the cache > transactionally (i.e. when it's known that no bad things can happen). > Maybe I got something wrong. > > > > > I should note that in all these cases, I got all the output, so the job > > runs despite the first error, likely causing the duplicate cache entry > > problems. > > Ah, I see. The failure occurs when dealing with kickstart which is after > the files are added to the cache. I did get something wrong. One solution would be to make kickstart transfer failure warnings instead of them being thrown as exceptions (easy). The other would be to only add the stageout files to the cache as the last thing in the execute2 big try block. (very slightly harder). Let me know which one you want. Mihael > > > > > - Mike > > > > > > > >> Clustered run is in run137, unclustered in run138 > > >> The latter log dir has a file swiftdata.find.out which lists all the > > >> files in my data dir (has a local/ branch at the top for localhost jobs). > > >> > > >> Error in both cases is below. > > >> > > >> Will try next doing kickstart in both ways via gram. > > >> > > >> - Mike > > >> > > >> 2007-11-04 18:47:40,946-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION > > >> jobid=angle4-cgqcmmji - Application exception: Missing argument jobdir > > >> for sys:element(rhost, wfdir, jobid, jobdir) > > >> 2007-11-04 18:47:41,085-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > > >> identity=urn:0-2-1194223436415) setting status to Failed > > >> org.globus.cog.abstraction.impl.file.FileNotFoundException: > > >> angle4-cgqcmmji-stderr.txt not found. > > >> 2007-11-04 18:47:41,344-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > > >> identity=urn:0-2-1194223436424) setting status to Failed > > >> org.globus.cog.abstraction.impl.file.FileNotFoundException: > > >> angle4-cgqcmmji-stdout.txt not found. > > >> 2007-11-04 18:47:41,503-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION > > >> jobid=angle4-bgqcmmji - Application exception: Missing argument jobdir > > >> for sys:element(rhost, wfdir, jobid, jobdir) > > >> 2007-11-04 18:47:41,553-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > > >> identity=urn:0-1-1194223436458) setting status to Failed > > >> org.globus.cog.abstraction.impl.file.FileNotFoundException: > > >> angle4-bgqcmmji-stderr.txt not found. > > >> 2007-11-04 18:47:41,638-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > > >> identity=urn:0-1-1194223436467) setting status to Failed > > >> org.globus.cog.abstraction.impl.file.FileNotFoundException: > > >> angle4-bgqcmmji-stdout.txt not found. > > >> 2007-11-04 18:47:41,882-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION > > >> jobid=angle4-agqcmmji - Application exception: Missing argument jobdir > > >> for sys:element(rhost, wfdir, jobid, jobdir) > > >> 2007-11-04 18:47:41,987-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > > >> identity=urn:0-3-1194223436500) setting status to Failed > > >> org.globus.cog.abstraction.impl.file.FileNotFoundException: > > >> angle4-agqcmmji-stderr.txt not found. > > >> 2007-11-04 18:47:42,047-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > > >> identity=urn:0-3-1194223436507) setting status to Failed > > >> org.globus.cog.abstraction.impl.file.FileNotFoundException: > > >> angle4-agqcmmji-stdout.txt not found. > > >> 2007-11-04 18:51:18,439-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION > > >> jobid=angle4-dgqcmmji - Application exception: The cache already > > >> contains localhost:awf4-20071104-1843-ds8hn11a/shared/cf0000.angle. > > >> 2007-11-04 18:51:18,628-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > > >> identity=urn:0-2-1194223436543) setting status to Failed > > >> org.globus.cog.abstraction.impl.file.FileNotFoundException: > > >> angle4-dgqcmmji-stderr.txt not found. > > >> 2007-11-04 18:51:18,762-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > > >> identity=urn:0-2-1194223436550) setting status to Failed > > >> org.globus.cog.abstraction.impl.file.FileNotFoundException: > > >> angle4-dgqcmmji-stdout.txt not found. > > >> 2007-11-04 18:51:25,976-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION > > >> jobid=angle4-egqcmmji - Application exception: The cache already > > >> contains localhost:awf4-20071104-1843-ds8hn11a/shared/of0002.angle. > > >> 2007-11-04 18:51:26,401-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > > >> identity=urn:0-1-1194223436585) setting status to Failed > > >> org.globus.cog.abstraction.impl.file.FileNotFoundException: > > >> angle4-egqcmmji-stderr.txt not found. > > >> 2007-11-04 18:51:26,726-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > > >> identity=urn:0-1-1194223436592) setting status to Failed > > >> org.globus.cog.abstraction.impl.file.FileNotFoundException: > > >> angle4-egqcmmji-stdout.txt not found. > > >> 2007-11-04 18:51:28,040-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION > > >> jobid=angle4-fgqcmmji - Application exception: The cache already > > >> contains localhost:awf4-20071104-1843-ds8hn11a/shared/cf0001.angle. > > >> 2007-11-04 18:51:28,492-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > > >> identity=urn:0-3-1194223436627) setting status to Failed > > >> org.globus.cog.abstraction.impl.file.FileNotFoundException: > > >> angle4-fgqcmmji-stderr.txt not found. > > >> 2007-11-04 18:51:28,816-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > > >> identity=urn:0-3-1194223436634) setting status to Failed > > >> org.globus.cog.abstraction.impl.file.FileNotFoundException: > > >> angle4-fgqcmmji-stdout.txt not found. > > >> 2007-11-04 18:54:44,088-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION > > >> jobid=angle4-hgqcmmji - Application exception: The cache already > > >> contains localhost:awf4-20071104-1843-ds8hn11a/shared/of0002.angle. > > >> 2007-11-04 18:54:44,440-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > > >> identity=urn:0-1-1194223436670) setting status to Failed > > >> org.globus.cog.abstraction.impl.file.FileNotFoundException: > > >> angle4-hgqcmmji-stderr.txt not found. > > >> 2007-11-04 18:54:44,652-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, > > >> identity=urn:0-1-1194223436677) setting status to Failed > > >> org.globus.cog.abstraction.impl.file.FileNotFoundException: > > >> angle4-hgqcmmji-stdout.txt not found. > > >> 2007-11-04 18:54:44,741-0600 DEBUG VDL2ExecutionContext Exception in angle4: > > >> Exception in angle4: > > >> sys:exception @ vdl-int.k, line: 423 > > >> at > > >> org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) > > >> 2007-11-04 18:54:46,190-0600 INFO ExecutionContext Detailed exception: > > >> Exception in angle4: > > >> sys:exception @ vdl-int.k, line: 423 > > >> at > > >> org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29) > > >> _______________________________________________ > > >> Swift-devel mailing list > > >> Swift-devel at ci.uchicago.edu > > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > >> > > > > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From wilde at mcs.anl.gov Sun Nov 4 21:38:44 2007 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sun, 04 Nov 2007 21:38:44 -0600 Subject: [Swift-devel] Jobs being aborted by PBS server on tg-grid.uc.teragrid.org In-Reply-To: <472E8C08.308@mcs.anl.gov> References: <472E8C08.308@mcs.anl.gov> Message-ID: <472E9044.9080600@mcs.anl.gov> Ive reported this to TG and Ti on the chance that its on the server side. If nothing else, possibly a PBS log can pinpoint what we're doing wrong if its us or me. The two runs below are in ~benc/swift-logs/wilde/ 7:46 PM - run142 8:57 PM - run142 Ive started to add a 'comment' file to my log dirs there to note the reason, and on occasion I copy output placed in cwd to _output. Also adding find or ls output to each dir when its relevant and I remember. Im trying to automate more of this as I go. - Mike On 11/4/07 9:20 PM, Michael Wilde wrote: > Im starting to see more frequent problems like this. > Happened once last night to 3 consecutive jobs, and tonight happened > twice, to 6 jobs. > > Ti, could you look in the PBS logs, possibly on the related node(s) and > see if its looking like a problem on tg-uc or on our side? > > Thanks, > > Mike > > > 11/3 8:05 PM - 3 failures > Job IDs 1571647, 48, & 49 > 11/4 7:46 PM - 3 failures > Job IDs 1572031, 33, & 34 > 11/4 8:56 - 8:57 PM > 1572040, 42, 43 > > All errors have the format below. > > Swift retries failing jobs 3 times, hence the groups of 3 above. > > > -------- Original Message -------- > Subject: PBS JOB 1572043.tg-master.uc.teragrid.org > Date: Sun, 4 Nov 2007 20:57:11 -0600 (CST) > From: adm at tg-master.uc.teragrid.org (root) > To: wilde at tg-grid1.uc.teragrid.org > > PBS Job Id: 1572043.tg-master.uc.teragrid.org > Job Name: STDIN > Aborted by PBS Server > Job cannot be executed > See Administrator for help > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From benc at hawaga.org.uk Sun Nov 4 21:41:20 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 5 Nov 2007 03:41:20 +0000 (GMT) Subject: [Swift-devel] Kickstart runs on localhost are failing In-Reply-To: <1194233838.7242.8.camel@blabla.mcs.anl.gov> References: <472E73E1.9030607@mcs.anl.gov> <1194232070.6373.1.camel@blabla.mcs.anl.gov> <472E8D59.8020206@mcs.anl.gov> <1194233524.7242.3.camel@blabla.mcs.anl.gov> <1194233838.7242.8.camel@blabla.mcs.anl.gov> Message-ID: On Sun, 4 Nov 2007, Mihael Hategan wrote: > > Ah, I see. The failure occurs when dealing with kickstart which is after > > the files are added to the cache. I did get something wrong. > > One solution would be to make kickstart transfer failure warnings > instead of them being thrown as exceptions (easy). > The other would be to only add the stageout files to the cache as the > last thing in the execute2 big try block. (very slightly harder). I think warnings are preferable. -- From hategan at mcs.anl.gov Sun Nov 4 21:48:21 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sun, 04 Nov 2007 21:48:21 -0600 Subject: [Swift-devel] Kickstart runs on localhost are failing In-Reply-To: References: <472E73E1.9030607@mcs.anl.gov> <1194232070.6373.1.camel@blabla.mcs.anl.gov> <472E8D59.8020206@mcs.anl.gov> <1194233524.7242.3.camel@blabla.mcs.anl.gov> <1194233838.7242.8.camel@blabla.mcs.anl.gov> Message-ID: <1194234501.7815.1.camel@blabla.mcs.anl.gov> On Mon, 2007-11-05 at 03:41 +0000, Ben Clifford wrote: > > On Sun, 4 Nov 2007, Mihael Hategan wrote: > > > > Ah, I see. The failure occurs when dealing with kickstart which is after > > > the files are added to the cache. I did get something wrong. > > > > One solution would be to make kickstart transfer failure warnings > > instead of them being thrown as exceptions (easy). > > > The other would be to only add the stageout files to the cache as the > > last thing in the execute2 big try block. (very slightly harder). > > I think warnings are preferable. Done (r1457). I have about 85% confidence that it will work as intended. > From bugzilla-daemon at mcs.anl.gov Sun Nov 4 21:59:17 2007 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Sun, 4 Nov 2007 21:59:17 -0600 (CST) Subject: [Swift-devel] [Bug 36] maxwalltime specs In-Reply-To: Message-ID: <20071105035917.340D9164BC@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=36 hategan at mcs.anl.gov changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from hategan at mcs.anl.gov 2007-11-04 21:59 ------- Closing due to lack of further complaints after fix. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. From bugzilla-daemon at mcs.anl.gov Mon Nov 5 07:24:29 2007 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Mon, 5 Nov 2007 07:24:29 -0600 (CST) Subject: [Swift-devel] [Bug 111] New: stage out -info and cluster logs in the same fashion as kickstart records. Message-ID: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=111 Summary: stage out -info and cluster logs in the same fashion as kickstart records. Product: Swift Version: unspecified Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: General AssignedTo: benc at hawaga.org.uk ReportedBy: benc at hawaga.org.uk CC: swift-devel at ci.uchicago.edu make staging of info, cluster and kickstart records consistent - at present, there are unnecessarily different ways of getting at them. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From benc at hawaga.org.uk Mon Nov 5 07:25:20 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 5 Nov 2007 13:25:20 +0000 (GMT) Subject: [Swift-devel] Kickstart on Angle vs not? In-Reply-To: <1575686083-1194132464-cardhu_decombobulator_blackberry.rim.net-1246028922-@bxe030.bisx.prod.on.blackberry> References: <1575686083-1194132464-cardhu_decombobulator_blackberry.rim.net-1246028922-@bxe030.bisx.prod.on.blackberry> Message-ID: On Sat, 3 Nov 2007, Ian Foster wrote: > we use a different mechanbism to retrieve kickstart ouput vs. Log file > output, it seems. I'd be interested to understand them. Given that the info records have been useful at least once, its probably useful to treat those, the kickstart records and the cluster logs all in the same fashion. I put that in the bugzilla as bug 111. -- From wilde at mcs.anl.gov Mon Nov 5 07:48:44 2007 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 05 Nov 2007 07:48:44 -0600 Subject: [Swift-devel] Kickstart on Angle vs not? In-Reply-To: References: <1575686083-1194132464-cardhu_decombobulator_blackberry.rim.net-1246028922-@bxe030.bisx.prod.on.blackberry> Message-ID: <472F1F3C.6030004@mcs.anl.gov> On 11/5/07 7:25 AM, Ben Clifford wrote: > > On Sat, 3 Nov 2007, Ian Foster wrote: > >> we use a different mechanbism to retrieve kickstart ouput vs. Log file >> output, it seems. I'd be interested to understand them. > > Given that the info records have been useful at least once, its probably > useful to treat those, the kickstart records and the cluster logs all in > the same fashion. Agreed. Should have an option to bring all (if feasible) or some (if too large) of this info back to submit host and store in log repository. An optional ls -lR of the server tree is often helpful to add. > I put that in the bugzilla as bug 111. From benc at hawaga.org.uk Mon Nov 5 07:50:57 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 5 Nov 2007 13:50:57 +0000 (GMT) Subject: [Swift-devel] Kickstart on Angle vs not? In-Reply-To: <472F1F3C.6030004@mcs.anl.gov> References: <1575686083-1194132464-cardhu_decombobulator_blackberry.rim.net-1246028922-@bxe030.bisx.prod.on.blackberry> <472F1F3C.6030004@mcs.anl.gov> Message-ID: On Mon, 5 Nov 2007, Michael Wilde wrote: > An optional ls -lR of the server tree is often helpful to add. Though touching directories on GPFS is likely to be expensive, I think - both for the job itself and for jobs simultaneouly running on other nodes. -- From wilde at mcs.anl.gov Mon Nov 5 09:27:24 2007 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 05 Nov 2007 09:27:24 -0600 Subject: [Swift-devel] How best to distribute named input and outut files across dirs? In-Reply-To: References: Message-ID: <472F365C.7060907@mcs.anl.gov> Whats the best way to spread output files across a directory if they are mapped, as opposed to anonymous? In awf2.swift the outputs went into a single big dir (below _concurrent) because they are neither mapped nor members of an array. In awf3.swift I switched to an array, and they were nicely (albeit verbosely ;) mapped to an array structure automatically. In awf4.swift I name the outputs, and the files are now nicely named but all reside back in the client submit directory. Now I want to make awf5, and spread named inputs and outputs across dirs. I recall suggesting a way to do this to Andrew, but didint track how he and you did it, Ben. Andrew, can you send me your latest swift code? Ben, Mihael, is the best way to do this to manually spread the inputs across a dirs, and map both the inputs and outputs using readdata? angleinput/{00 through 99}/pcNNNN.pcap angleout/{00 through 99}/ofNNNN.angle,cfNNNN.center} I need to focus on a few admin things for a bit, but any/all advice is welcome. :::::::::::::: awf2.swift :::::::::::::: type pcapfile; type angleout; type anglecenter; (angleout ofile, anglecenter cfile) angle4 (pcapfile ifile) { app { angle4 @ifile @ofile @cfile; } } pcapfile pcapfiles[]; foreach pf in pcapfiles { angleout of; anglecenter cf; (of,cf) = angle4(pf); } :::::::::::::: awf3.swift :::::::::::::: type pcapfile; type angleout; type anglecenter; (angleout ofile, anglecenter cfile) angle4 (pcapfile ifile) { app { angle4 @ifile @ofile @cfile; } } pcapfile pcapfiles[]; angleout of[]; anglecenter cf[]; foreach pf,i in pcapfiles { (of[i],cf[i]) = angle4(pf); } :::::::::::::: awf4.swift :::::::::::::: type pcapfile; type angleout; type anglecenter; (angleout ofile, anglecenter cfile) angle4 (pcapfile ifile) { app { angle4 @ifile @ofile @cfile; } } pcapfile pcapfiles[]; angleout of[] ; anglecenter cf[] ; // note i used .angle for both in current tests... foreach pf,i in pcapfiles { (of[i],cf[i]) = angle4(pf); } On 11/1/07 11:57 AM, Ben Clifford wrote: > I just modified the way that ConcurrentMapper lays out files (r1437) > > You will likely not have encountered ConcurrentMapper by name. It is used > when you do not specify a mapper for a dataset, for example for > intermediate variables. > > Previously, all files named by this mapper were given a long name in the > root directory of the submit and cache directories. > > When a large number of files were named in this fashion, for example in an > array with thousands of elements, this would result in a file for each > element and a root directory with thousands of files. > > Most immediately I encountered this problem working with Andrew Jamieson > running on TeraPort using GPFS. Many hosts attempting to access one > directory is severely unscalable on GPFS. > > The changes I have made add more structure to filenames generated by the > ConcurrentMapper: > > > 1. All files appear in a _concurrent/ subdirectory. > > > 2. Simple/marker data typed files appear directly below _concurrent, > named as before. For example: > > file outfile; > > might give a filename: > > _concurrent//outfile-3339612a-08e1-443d-bd14-2329080d2d94- > > > 3. Structures are mapped to a sub-directory, with each element being a > file in that subdirectory. For example, > > type footype { file left; file right; } > footype structurefile; > > might give a directory: > > _concurrent//structurefile-c68b99dc-de3c-4288-822f-2ab3d4dc6427--field > > containing two files: > > _concurrent//structurefile-c68b99dc-de3c-4288-822f-2ab3d4dc6427--field/left > _concurrent//structurefile-c68b99dc-de3c-4288-822f-2ab3d4dc6427--field/right > > > 4. Array elements are placed in a subdirectory. Within that subdirectory, > the index is using to construct a further hierarchy such that there will > never be more than 50 directories/files in any one directory. For example: > > file manyfile[]; > > might give mappings like this: > > myfile[0] stored in: > _concurrent//manyfile-0b91d809-37f5-46da-91c8-6c4a9157b06b--array/elt-0 > > myfile[22] stored in: > _concurrent//manyfile-0b91d809-37f5-46da-91c8-6c4a9157b06b--array/elt-22 > > myfile[30] stored in: > _concurrent//manyfile-0b91d809-37f5-46da-91c8-6c4a9157b06b--array/h5/elt-30 > > myfile[734] stored in: > _concurrent//manyfile-bcdeedee-4df7-4d21-a207-d8051da3d133--array/h9/h4/elt-734 > > To form the paths, basically something like this happens: > convert each number into base 25. discard the most significant digit. > then starting at the least significant digit and working towards > the most significant digit, make that digit into a subdirectory. > > For example, 734 in base 10 is (1) (4) (9) in base 25 > > so we form intermediate path /h9/h4/ > > Doing this means that for large arrays directory paths will grow, whilst > for small arrays will be short; and the size of the array does not need to > be known ahead of time. > > The constant '25' can easily be adjusted. Its a compiled-in constant > defined in one place at the moment, but could be made into a mapper > parameter. > From andrewj at uchicago.edu Mon Nov 5 09:55:12 2007 From: andrewj at uchicago.edu (Andrew Robert Jamieson) Date: Mon, 5 Nov 2007 09:55:12 -0600 (CST) Subject: [Swift-devel] How best to distribute named input and outut files across dirs? In-Reply-To: <472F365C.7060907@mcs.anl.gov> References: <472F365C.7060907@mcs.anl.gov> Message-ID: Hey Mike and others, I used that splitting bash script to separate the files into subdirectories. Then I used that other script you helped me with to find where I put those files. This script generated the .csv which was then read by the csv mapper. Nothing fancy. -Andrew On Mon, 5 Nov 2007, Michael Wilde wrote: > Whats the best way to spread output files across a directory if they are > mapped, as opposed to anonymous? > > In awf2.swift the outputs went into a single big dir (below _concurrent) > because they are neither mapped nor members of an array. > > In awf3.swift I switched to an array, and they were nicely (albeit verbosely > ;) mapped to an array structure automatically. > > In awf4.swift I name the outputs, and the files are now nicely named but all > reside back in the client submit directory. > > Now I want to make awf5, and spread named inputs and outputs across dirs. I > recall suggesting a way to do this to Andrew, but didint track how he and you > did it, Ben. > > Andrew, can you send me your latest swift code? > > Ben, Mihael, is the best way to do this to manually spread the inputs across > a dirs, and map both the inputs and outputs using readdata? > > angleinput/{00 through 99}/pcNNNN.pcap > > angleout/{00 through 99}/ofNNNN.angle,cfNNNN.center} > > I need to focus on a few admin things for a bit, but any/all advice is > welcome. > > > > :::::::::::::: > awf2.swift > :::::::::::::: > type pcapfile; > type angleout; > type anglecenter; > > (angleout ofile, anglecenter cfile) angle4 (pcapfile ifile) > { > app { angle4 @ifile @ofile @cfile; } > } > > pcapfile pcapfiles[]; > > foreach pf in pcapfiles { > angleout of; > anglecenter cf; > (of,cf) = angle4(pf); > } > :::::::::::::: > awf3.swift > :::::::::::::: > type pcapfile; > type angleout; > type anglecenter; > > (angleout ofile, anglecenter cfile) angle4 (pcapfile ifile) > { > app { angle4 @ifile @ofile @cfile; } > } > > pcapfile pcapfiles[]; > > angleout of[]; > anglecenter cf[]; > > foreach pf,i in pcapfiles { > (of[i],cf[i]) = angle4(pf); > } > :::::::::::::: > awf4.swift > :::::::::::::: > type pcapfile; > type angleout; > type anglecenter; > > (angleout ofile, anglecenter cfile) angle4 (pcapfile ifile) > { > app { angle4 @ifile @ofile @cfile; } > } > > pcapfile pcapfiles[]; > > angleout of[] ; > anglecenter cf[] ; > // note i used .angle for both in current tests... > > foreach pf,i in pcapfiles { > (of[i],cf[i]) = angle4(pf); > } > > > > On 11/1/07 11:57 AM, Ben Clifford wrote: >> I just modified the way that ConcurrentMapper lays out files (r1437) >> >> You will likely not have encountered ConcurrentMapper by name. It is used >> when you do not specify a mapper for a dataset, for example for >> intermediate variables. >> >> Previously, all files named by this mapper were given a long name in the >> root directory of the submit and cache directories. >> >> When a large number of files were named in this fashion, for example in an >> array with thousands of elements, this would result in a file for each >> element and a root directory with thousands of files. >> >> Most immediately I encountered this problem working with Andrew Jamieson >> running on TeraPort using GPFS. Many hosts attempting to access one >> directory is severely unscalable on GPFS. >> >> The changes I have made add more structure to filenames generated by the >> ConcurrentMapper: >> >> >> 1. All files appear in a _concurrent/ subdirectory. >> >> >> 2. Simple/marker data typed files appear directly below _concurrent, named >> as before. For example: >> >> file outfile; >> >> might give a filename: >> >> _concurrent//outfile-3339612a-08e1-443d-bd14-2329080d2d94- >> >> >> 3. Structures are mapped to a sub-directory, with each element being a >> file in that subdirectory. For example, >> >> type footype { file left; file right; } >> footype structurefile; >> >> might give a directory: >> >> _concurrent//structurefile-c68b99dc-de3c-4288-822f-2ab3d4dc6427--field >> >> containing two files: >> >> _concurrent//structurefile-c68b99dc-de3c-4288-822f-2ab3d4dc6427--field/left >> _concurrent//structurefile-c68b99dc-de3c-4288-822f-2ab3d4dc6427--field/right >> >> >> 4. Array elements are placed in a subdirectory. Within that subdirectory, >> the index is using to construct a further hierarchy such that there will >> never be more than 50 directories/files in any one directory. For example: >> >> file manyfile[]; >> >> might give mappings like this: >> >> myfile[0] stored in: >> _concurrent//manyfile-0b91d809-37f5-46da-91c8-6c4a9157b06b--array/elt-0 >> >> myfile[22] stored in: >> _concurrent//manyfile-0b91d809-37f5-46da-91c8-6c4a9157b06b--array/elt-22 >> >> myfile[30] stored in: >> _concurrent//manyfile-0b91d809-37f5-46da-91c8-6c4a9157b06b--array/h5/elt-30 >> >> myfile[734] stored in: >> _concurrent//manyfile-bcdeedee-4df7-4d21-a207-d8051da3d133--array/h9/h4/elt-734 >> >> To form the paths, basically something like this happens: >> convert each number into base 25. discard the most significant digit. then >> starting at the least significant digit and working towards the most >> significant digit, make that digit into a subdirectory. >> >> For example, 734 in base 10 is (1) (4) (9) in base 25 >> >> so we form intermediate path /h9/h4/ >> >> Doing this means that for large arrays directory paths will grow, whilst >> for small arrays will be short; and the size of the array does not need to >> be known ahead of time. >> >> The constant '25' can easily be adjusted. Its a compiled-in constant >> defined in one place at the moment, but could be made into a mapper >> parameter. >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From andrewj at uchicago.edu Mon Nov 5 09:58:08 2007 From: andrewj at uchicago.edu (Andrew Robert Jamieson) Date: Mon, 5 Nov 2007 09:58:08 -0600 (CST) Subject: [Swift-devel] How best to distribute named input and outut files across dirs? In-Reply-To: <472F365C.7060907@mcs.anl.gov> References: <472F365C.7060907@mcs.anl.gov> Message-ID: Latest swift code should be in ~/CADGrid/Swifty/ or ~/CADGrid/Swifty/skynet-swift-runs/ On Mon, 5 Nov 2007, Michael Wilde wrote: > Whats the best way to spread output files across a directory if they are > mapped, as opposed to anonymous? > > In awf2.swift the outputs went into a single big dir (below _concurrent) > because they are neither mapped nor members of an array. > > In awf3.swift I switched to an array, and they were nicely (albeit verbosely > ;) mapped to an array structure automatically. > > In awf4.swift I name the outputs, and the files are now nicely named but all > reside back in the client submit directory. > > Now I want to make awf5, and spread named inputs and outputs across dirs. I > recall suggesting a way to do this to Andrew, but didint track how he and you > did it, Ben. > > Andrew, can you send me your latest swift code? > > Ben, Mihael, is the best way to do this to manually spread the inputs across > a dirs, and map both the inputs and outputs using readdata? > > angleinput/{00 through 99}/pcNNNN.pcap > > angleout/{00 through 99}/ofNNNN.angle,cfNNNN.center} > > I need to focus on a few admin things for a bit, but any/all advice is > welcome. > > > > :::::::::::::: > awf2.swift > :::::::::::::: > type pcapfile; > type angleout; > type anglecenter; > > (angleout ofile, anglecenter cfile) angle4 (pcapfile ifile) > { > app { angle4 @ifile @ofile @cfile; } > } > > pcapfile pcapfiles[]; > > foreach pf in pcapfiles { > angleout of; > anglecenter cf; > (of,cf) = angle4(pf); > } > :::::::::::::: > awf3.swift > :::::::::::::: > type pcapfile; > type angleout; > type anglecenter; > > (angleout ofile, anglecenter cfile) angle4 (pcapfile ifile) > { > app { angle4 @ifile @ofile @cfile; } > } > > pcapfile pcapfiles[]; > > angleout of[]; > anglecenter cf[]; > > foreach pf,i in pcapfiles { > (of[i],cf[i]) = angle4(pf); > } > :::::::::::::: > awf4.swift > :::::::::::::: > type pcapfile; > type angleout; > type anglecenter; > > (angleout ofile, anglecenter cfile) angle4 (pcapfile ifile) > { > app { angle4 @ifile @ofile @cfile; } > } > > pcapfile pcapfiles[]; > > angleout of[] ; > anglecenter cf[] ; > // note i used .angle for both in current tests... > > foreach pf,i in pcapfiles { > (of[i],cf[i]) = angle4(pf); > } > > > > On 11/1/07 11:57 AM, Ben Clifford wrote: >> I just modified the way that ConcurrentMapper lays out files (r1437) >> >> You will likely not have encountered ConcurrentMapper by name. It is used >> when you do not specify a mapper for a dataset, for example for >> intermediate variables. >> >> Previously, all files named by this mapper were given a long name in the >> root directory of the submit and cache directories. >> >> When a large number of files were named in this fashion, for example in an >> array with thousands of elements, this would result in a file for each >> element and a root directory with thousands of files. >> >> Most immediately I encountered this problem working with Andrew Jamieson >> running on TeraPort using GPFS. Many hosts attempting to access one >> directory is severely unscalable on GPFS. >> >> The changes I have made add more structure to filenames generated by the >> ConcurrentMapper: >> >> >> 1. All files appear in a _concurrent/ subdirectory. >> >> >> 2. Simple/marker data typed files appear directly below _concurrent, named >> as before. For example: >> >> file outfile; >> >> might give a filename: >> >> _concurrent//outfile-3339612a-08e1-443d-bd14-2329080d2d94- >> >> >> 3. Structures are mapped to a sub-directory, with each element being a >> file in that subdirectory. For example, >> >> type footype { file left; file right; } >> footype structurefile; >> >> might give a directory: >> >> _concurrent//structurefile-c68b99dc-de3c-4288-822f-2ab3d4dc6427--field >> >> containing two files: >> >> _concurrent//structurefile-c68b99dc-de3c-4288-822f-2ab3d4dc6427--field/left >> _concurrent//structurefile-c68b99dc-de3c-4288-822f-2ab3d4dc6427--field/right >> >> >> 4. Array elements are placed in a subdirectory. Within that subdirectory, >> the index is using to construct a further hierarchy such that there will >> never be more than 50 directories/files in any one directory. For example: >> >> file manyfile[]; >> >> might give mappings like this: >> >> myfile[0] stored in: >> _concurrent//manyfile-0b91d809-37f5-46da-91c8-6c4a9157b06b--array/elt-0 >> >> myfile[22] stored in: >> _concurrent//manyfile-0b91d809-37f5-46da-91c8-6c4a9157b06b--array/elt-22 >> >> myfile[30] stored in: >> _concurrent//manyfile-0b91d809-37f5-46da-91c8-6c4a9157b06b--array/h5/elt-30 >> >> myfile[734] stored in: >> _concurrent//manyfile-bcdeedee-4df7-4d21-a207-d8051da3d133--array/h9/h4/elt-734 >> >> To form the paths, basically something like this happens: >> convert each number into base 25. discard the most significant digit. then >> starting at the least significant digit and working towards the most >> significant digit, make that digit into a subdirectory. >> >> For example, 734 in base 10 is (1) (4) (9) in base 25 >> >> so we form intermediate path /h9/h4/ >> >> Doing this means that for large arrays directory paths will grow, whilst >> for small arrays will be short; and the size of the array does not need to >> be known ahead of time. >> >> The constant '25' can easily be adjusted. Its a compiled-in constant >> defined in one place at the moment, but could be made into a mapper >> parameter. >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Mon Nov 5 10:54:03 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 5 Nov 2007 16:54:03 +0000 (GMT) Subject: [Swift-devel] Re: How best to distribute named input and outut files across dirs? In-Reply-To: <472F365C.7060907@mcs.anl.gov> References: <472F365C.7060907@mcs.anl.gov> Message-ID: I'm confused by your use of the concurrent mapper with the word 'output' - anything appearing under _concurrent is rather arbitrarily named. For inputs, how are you specifying input mapping at the moment? Can you give the mapper declaration you use for inputs? For outputs, some ideas: i) explicitly map output paths using the CSV mapper or execution mapper. ii) write a custom mapper or have one of us do it that has more hierarchical behaviour. On Mon, 5 Nov 2007, Michael Wilde wrote: > Whats the best way to spread output files across a directory if they are > mapped, as opposed to anonymous? > > In awf2.swift the outputs went into a single big dir (below _concurrent) > because they are neither mapped nor members of an array. > > In awf3.swift I switched to an array, and they were nicely (albeit verbosely > ;) mapped to an array structure automatically. > > In awf4.swift I name the outputs, and the files are now nicely named but all > reside back in the client submit directory. > > Now I want to make awf5, and spread named inputs and outputs across dirs. I > recall suggesting a way to do this to Andrew, but didint track how he and you > did it, Ben. > > Andrew, can you send me your latest swift code? > > Ben, Mihael, is the best way to do this to manually spread the inputs across a > dirs, and map both the inputs and outputs using readdata? > > angleinput/{00 through 99}/pcNNNN.pcap > > angleout/{00 through 99}/ofNNNN.angle,cfNNNN.center} > > I need to focus on a few admin things for a bit, but any/all advice is > welcome. > > > > :::::::::::::: > awf2.swift > :::::::::::::: > type pcapfile; > type angleout; > type anglecenter; > > (angleout ofile, anglecenter cfile) angle4 (pcapfile ifile) > { > app { angle4 @ifile @ofile @cfile; } > } > > pcapfile pcapfiles[]; > > foreach pf in pcapfiles { > angleout of; > anglecenter cf; > (of,cf) = angle4(pf); > } > :::::::::::::: > awf3.swift > :::::::::::::: > type pcapfile; > type angleout; > type anglecenter; > > (angleout ofile, anglecenter cfile) angle4 (pcapfile ifile) > { > app { angle4 @ifile @ofile @cfile; } > } > > pcapfile pcapfiles[]; > > angleout of[]; > anglecenter cf[]; > > foreach pf,i in pcapfiles { > (of[i],cf[i]) = angle4(pf); > } > :::::::::::::: > awf4.swift > :::::::::::::: > type pcapfile; > type angleout; > type anglecenter; > > (angleout ofile, anglecenter cfile) angle4 (pcapfile ifile) > { > app { angle4 @ifile @ofile @cfile; } > } > > pcapfile pcapfiles[]; > > angleout of[] ; > anglecenter cf[] ; > // note i used .angle for both in current tests... > > foreach pf,i in pcapfiles { > (of[i],cf[i]) = angle4(pf); > } > > > > On 11/1/07 11:57 AM, Ben Clifford wrote: > > I just modified the way that ConcurrentMapper lays out files (r1437) > > > > You will likely not have encountered ConcurrentMapper by name. It is used > > when you do not specify a mapper for a dataset, for example for intermediate > > variables. > > > > Previously, all files named by this mapper were given a long name in the > > root directory of the submit and cache directories. > > > > When a large number of files were named in this fashion, for example in an > > array with thousands of elements, this would result in a file for each > > element and a root directory with thousands of files. > > > > Most immediately I encountered this problem working with Andrew Jamieson > > running on TeraPort using GPFS. Many hosts attempting to access one > > directory is severely unscalable on GPFS. > > > > The changes I have made add more structure to filenames generated by the > > ConcurrentMapper: > > > > > > 1. All files appear in a _concurrent/ subdirectory. > > > > > > 2. Simple/marker data typed files appear directly below _concurrent, named > > as before. For example: > > > > file outfile; > > > > might give a filename: > > > > _concurrent//outfile-3339612a-08e1-443d-bd14-2329080d2d94- > > > > > > 3. Structures are mapped to a sub-directory, with each element being a file > > in that subdirectory. For example, > > > > type footype { file left; file right; } > > footype structurefile; > > > > might give a directory: > > > > _concurrent//structurefile-c68b99dc-de3c-4288-822f-2ab3d4dc6427--field > > > > containing two files: > > > > _concurrent//structurefile-c68b99dc-de3c-4288-822f-2ab3d4dc6427--field/left > > _concurrent//structurefile-c68b99dc-de3c-4288-822f-2ab3d4dc6427--field/right > > > > > > 4. Array elements are placed in a subdirectory. Within that subdirectory, > > the index is using to construct a further hierarchy such that there will > > never be more than 50 directories/files in any one directory. For example: > > > > file manyfile[]; > > > > might give mappings like this: > > > > myfile[0] stored in: > > _concurrent//manyfile-0b91d809-37f5-46da-91c8-6c4a9157b06b--array/elt-0 > > > > myfile[22] stored in: > > _concurrent//manyfile-0b91d809-37f5-46da-91c8-6c4a9157b06b--array/elt-22 > > > > myfile[30] stored in: > > _concurrent//manyfile-0b91d809-37f5-46da-91c8-6c4a9157b06b--array/h5/elt-30 > > > > myfile[734] stored in: > > _concurrent//manyfile-bcdeedee-4df7-4d21-a207-d8051da3d133--array/h9/h4/elt-734 > > > > To form the paths, basically something like this happens: > > convert each number into base 25. discard the most significant digit. then > > starting at the least significant digit and working towards the most > > significant digit, make that digit into a subdirectory. > > > > For example, 734 in base 10 is (1) (4) (9) in base 25 > > > > so we form intermediate path /h9/h4/ > > > > Doing this means that for large arrays directory paths will grow, whilst for > > small arrays will be short; and the size of the array does not need to be > > known ahead of time. > > > > The constant '25' can easily be adjusted. Its a compiled-in constant defined > > in one place at the moment, but could be made into a mapper parameter. > > > > From benc at hawaga.org.uk Mon Nov 5 11:06:07 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Mon, 5 Nov 2007 17:06:07 +0000 (GMT) Subject: [Swift-devel] Re: How best to distribute named input and outut files across dirs? In-Reply-To: <472F365C.7060907@mcs.anl.gov> References: <472F365C.7060907@mcs.anl.gov> Message-ID: On Mon, 5 Nov 2007, Michael Wilde wrote: > In awf3.swift I switched to an array, and they were nicely (albeit verbosely > ;) mapped to an array structure automatically. > > In awf4.swift I name the outputs, and the files are now nicely named but all > reside back in the client submit directory. > > Now I want to make awf5, and spread named inputs and outputs across dirs. I > recall suggesting a way to do this to Andrew, but didint track how he and you > did it, Ben. If you're explicitly naming ouputs in awf4, you can explicitly name them in awf5 too. Put '/' symbols in the filenames to indicate directory cuts, like in URIs or filenames. -- From wilde at mcs.anl.gov Mon Nov 5 18:46:00 2007 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 05 Nov 2007 18:46:00 -0600 Subject: [Swift-devel] slow swift startup time Message-ID: <472FB948.8010605@mcs.anl.gov> Im running on tg-login.uc.teragrid.org (IA64) and seeing very long swift start times. My swift command wrapper prints the wf start and end times with the swift stdout sandwiched in between. Here's an example of those, followed by the swift log file. In this run, i start swift in the background, then tail the stdout file. it was about 70 seconds (on my watch) before swift responded with its initial messages on stdout. (I dont think its being buffered, but thats worth checking...) Note that swift was launched at 18:30:49 and its logfile entry with the runid came at 18:32:05. 32:05-30:49 = 76 seconds! This was swift 1456 compiled on terminable (or login, i forget). Suspicious: when I was running a version compiled in tg-login under Java 1.4 I would get an error message from a Java method trying to lock the log file. Not sure if this logging action (which now does not give a message) is related to this slow start time. - Mike UC64$ cat swift.out Swift Script local-noks.swift starting at Mon Nov 5 18:30:49 CST 2007 Swift v0.3-dev r1456 RunID: 20071105-1831-d7t5l2n3 angle4 started angle4 started angle4 started angle4 started angle4 started angle4 completed angle4 completed angle4 completed angle4 completed angle4 completed Swift Script local-noks.swift ended at Mon Nov 5 18:32:34 CST 2007 with exit code 0 UC64$ head awf*.log 2007-11-05 18:31:09,566-0600 INFO Loader awf6.swift: source file is new. Recompiling. 2007-11-05 18:31:30,454-0600 INFO Karajan Validation of XML intermediate file was successful 2007-11-05 18:31:55,465-0600 INFO unknown Using sites file: ./sites.xml 2007-11-05 18:31:55,466-0600 INFO unknown Using tc.data: ./tc.data 2007-11-05 18:32:05,518-0600 INFO unknown Swift v0.3-dev r1456 2007-11-05 18:32:05,520-0600 INFO unknown RunID: 20071105-1831-d7t5l2n3 2007-11-05 18:32:06,701-0600 INFO AbstractDataNode Found data pcapfiles.$[]/1.[0] 2007-11-05 18:32:06,702-0600 INFO AbstractDataNode Found data pcapfiles.$[]/2.[1] 2007-11-05 18:32:06,702-0600 INFO AbstractDataNode Found data pcapfiles.$[]/3.[2] UC64$ From wilde at mcs.anl.gov Mon Nov 5 18:56:45 2007 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 05 Nov 2007 18:56:45 -0600 Subject: [Swift-devel] slow swift startup time In-Reply-To: <472FB948.8010605@mcs.anl.gov> References: <472FB948.8010605@mcs.anl.gov> Message-ID: <472FBBCD.3020908@mcs.anl.gov> The lock error I was referring to is this, on stdout/err: Failed to acquire exclusive lock on log file. Below is the log file text that accompanied it. Note that Im not complaining about this error - it went away when I started compiling on terminable again. Im just pointing it out as a suspect in the slow startup. It surprised me that we bother to lock the logfile, unless Java is gratuituously doig it for us. - Mike Logfile head showed: 2007-11-04 10:15:28,949-0600 INFO Loader awf3.swift: source file is new. Recompiling. 2007-11-04 10:15:31,315-0600 INFO Karajan Validation of XML intermediate file was successful 2007-11-04 10:15:34,597-0600 INFO unknown Using sites file: ./sites.xml 2007-11-04 10:15:34,599-0600 INFO unknown Using tc.data: ./tc.data 2007-11-04 10:15:36,606-0600 INFO unknown Swift v0.3-dev libexec/svn-revision: line 1: svn: command not found libexec/svn-revision: line 1: svn: command not found 2007-11-04 10:15:36,607-0600 INFO unknown RunID: 20071104-1015-afgc18i3 2007-11-04 10:15:36,825-0600 INFO AbstractDataNode Found data pcapfiles.$[]/1.[0] 2007-11-04 10:15:36,825-0600 INFO AbstractDataNode Found data pcapfiles.$[]/2.[1] 2007-11-04 10:15:36,825-0600 INFO AbstractDataNode Found data pcapfiles.$[]/3.[2] 2007-11-04 10:15:36,825-0600 INFO AbstractDataNode Found data pcapfiles.$[]/4.[3] 2007-11-04 10:15:36,825-0600 INFO AbstractDataNode Found data pcapfiles.$[]/5.[4] 2007-11-04 10:16:26,917-0600 INFO FlushableLockedFileWriter Could not acquire lock on /home/wilde/angle/data/./awf3-20071104-1015 -afgc18i3.0.rlog java.io.IOException: No locks available at sun.nio.ch.FileChannelImpl.lock0(Native Method) at sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:804) at java.nio.channels.FileChannel.tryLock(FileChannel.java:983) at org.globus.cog.karajan.workflow.nodes.restartLog.FlushableLockedFileWriter.(FlushableLockedFileWriter.java:26) at org.globus.cog.karajan.workflow.nodes.restartLog.RestartLog.create(RestartLog.java:123) at org.globus.cog.karajan.workflow.nodes.restartLog.RestartLog.partialArgumentsEvaluated(RestartLog.java:55) at org.globus.cog.karajan.workflow.nodes.PartialArgumentsContainer.executeChildren(PartialArgumentsContainer.java:51) at org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java(Compiled Code)) at org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java(Compiled Code)) at org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java(Inlined Compiled Code)) at org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java(Compiled Code)) at org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java(Compiled Code)) at org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java(Compiled Code)) at org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java(Compiled Code)) at org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java(Inlined Compiled Code)) at org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java(Compiled Code)) 2007-11-04 10:16:26,921-0600 WARN RestartLog Failed to acquire exclusive lock on log file. 2007-11-04 10:16:26,999-0600 INFO vdl:execute START thread=0-4 tr=angle4 2007-11-04 10:16:26,999-0600 INFO vdl:execute START thread=0-1 tr=angle4 2007-11-04 10:16:26,999-0600 INFO vdl:execute START thread=0-5 tr=angle4 2007-11-04 10:16:27,000-0600 INFO vdl:execute START thread=0-3 tr=angle4 2007-11-04 10:16:27,001-0600 INFO vdl:execute START thread=0-2 tr=angle4 On 11/5/07 6:46 PM, Michael Wilde wrote: > Im running on tg-login.uc.teragrid.org (IA64) and seeing very long swift > start times. > > My swift command wrapper prints the wf start and end times with the > swift stdout sandwiched in between. Here's an example of those, > followed by the swift log file. In this run, i start swift in the > background, then tail the stdout file. it was about 70 seconds (on my > watch) before swift responded with its initial messages on stdout. (I > dont think its being buffered, but thats worth checking...) > > Note that swift was launched at 18:30:49 and its logfile entry with the > runid came at 18:32:05. 32:05-30:49 = 76 seconds! > > This was swift 1456 compiled on terminable (or login, i forget). > > Suspicious: when I was running a version compiled in tg-login under Java > 1.4 I would get an error message from a Java method trying to lock the > log file. Not sure if this logging action (which now does not give a > message) is related to this slow start time. > > - Mike > > UC64$ cat swift.out > Swift Script local-noks.swift starting at Mon Nov 5 18:30:49 CST 2007 > > Swift v0.3-dev r1456 > > RunID: 20071105-1831-d7t5l2n3 > angle4 started > angle4 started > angle4 started > angle4 started > angle4 started > angle4 completed > angle4 completed > angle4 completed > angle4 completed > angle4 completed > > Swift Script local-noks.swift ended at Mon Nov 5 18:32:34 CST 2007 with > exit code 0 > > > UC64$ head awf*.log > 2007-11-05 18:31:09,566-0600 INFO Loader awf6.swift: source file is > new. Recompiling. > 2007-11-05 18:31:30,454-0600 INFO Karajan Validation of XML > intermediate file was successful > 2007-11-05 18:31:55,465-0600 INFO unknown Using sites file: ./sites.xml > 2007-11-05 18:31:55,466-0600 INFO unknown Using tc.data: ./tc.data > 2007-11-05 18:32:05,518-0600 INFO unknown Swift v0.3-dev r1456 > > 2007-11-05 18:32:05,520-0600 INFO unknown RunID: 20071105-1831-d7t5l2n3 > 2007-11-05 18:32:06,701-0600 INFO AbstractDataNode Found data > pcapfiles.$[]/1.[0] > 2007-11-05 18:32:06,702-0600 INFO AbstractDataNode Found data > pcapfiles.$[]/2.[1] > 2007-11-05 18:32:06,702-0600 INFO AbstractDataNode Found data > pcapfiles.$[]/3.[2] > UC64$ > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From wilde at mcs.anl.gov Mon Nov 5 19:19:26 2007 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 05 Nov 2007 19:19:26 -0600 Subject: [Swift-devel] filesys_mapper doesnt take structured filenames Message-ID: <472FC11E.2090300@mcs.anl.gov> Ben, did you say that this mapper invocation *should* take directories? It doesnt seem to: pcapfile pcapfiles[]; The full code is below. The program exits without finding anything. The dir input/ is in my cwd when running swift and contains pc1.pcap thru pc5.pcap. - Mike UC64$ cat awf6.swift type pcapfile; type angleout; type anglecenter; (angleout ofile, anglecenter cfile) angle4 (pcapfile ifile) { app { angle4 @ifile @ofile @cfile; } } pcapfile pcapfiles[]; angleout of[] ; anglecenter cf[] ; foreach pf,i in pcapfiles { (of[i],cf[i]) = angle4(pf); } UC64$ -- UC64$ pwd /home/wilde/angle/data UC64$ ls input pc1.pcap pc2.pcap pc3.pcap pc4.pcap pc5.pcap UC64$ From hategan at mcs.anl.gov Mon Nov 5 20:45:42 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 05 Nov 2007 20:45:42 -0600 Subject: [Swift-devel] slow swift startup time In-Reply-To: <472FBBCD.3020908@mcs.anl.gov> References: <472FB948.8010605@mcs.anl.gov> <472FBBCD.3020908@mcs.anl.gov> Message-ID: <1194317143.8333.3.camel@blabla.mcs.anl.gov> It's a warning. warning != error. On Mon, 2007-11-05 at 18:56 -0600, Michael Wilde wrote: > The lock error I was referring to is this, on stdout/err: > > Failed to acquire exclusive lock on log file. > > Below is the log file text that accompanied it. > > Note that Im not complaining about this error - it went away when I > started compiling on terminable again. > > Im just pointing it out as a suspect in the slow startup. It surprised > me that we bother to lock the logfile, unless Java is gratuituously doig > it for us. > > - Mike > > Logfile head showed: > > 2007-11-04 10:15:28,949-0600 INFO Loader awf3.swift: source file is > new. Recompiling. > 2007-11-04 10:15:31,315-0600 INFO Karajan Validation of XML > intermediate file was successful > 2007-11-04 10:15:34,597-0600 INFO unknown Using sites file: ./sites.xml > 2007-11-04 10:15:34,599-0600 INFO unknown Using tc.data: ./tc.data > 2007-11-04 10:15:36,606-0600 INFO unknown Swift v0.3-dev > libexec/svn-revision: line 1: svn: command not found > libexec/svn-revision: line 1: svn: command not found > > > 2007-11-04 10:15:36,607-0600 INFO unknown RunID: 20071104-1015-afgc18i3 > 2007-11-04 10:15:36,825-0600 INFO AbstractDataNode Found data > pcapfiles.$[]/1.[0] > 2007-11-04 10:15:36,825-0600 INFO AbstractDataNode Found data > pcapfiles.$[]/2.[1] > 2007-11-04 10:15:36,825-0600 INFO AbstractDataNode Found data > pcapfiles.$[]/3.[2] > 2007-11-04 10:15:36,825-0600 INFO AbstractDataNode Found data > pcapfiles.$[]/4.[3] > 2007-11-04 10:15:36,825-0600 INFO AbstractDataNode Found data > pcapfiles.$[]/5.[4] > 2007-11-04 10:16:26,917-0600 INFO FlushableLockedFileWriter Could not > acquire lock on /home/wilde/angle/data/./awf3-20071104-1015 > -afgc18i3.0.rlog > java.io.IOException: No locks available > at sun.nio.ch.FileChannelImpl.lock0(Native Method) > at sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:804) > at java.nio.channels.FileChannel.tryLock(FileChannel.java:983) > at > org.globus.cog.karajan.workflow.nodes.restartLog.FlushableLockedFileWriter.(FlushableLockedFileWriter.java:26) > at > org.globus.cog.karajan.workflow.nodes.restartLog.RestartLog.create(RestartLog.java:123) > at > org.globus.cog.karajan.workflow.nodes.restartLog.RestartLog.partialArgumentsEvaluated(RestartLog.java:55) > at > org.globus.cog.karajan.workflow.nodes.PartialArgumentsContainer.executeChildren(PartialArgumentsContainer.java:51) > at > org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java(Compiled > Code)) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java(Compiled > Code)) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java(Inlined > Compiled Code)) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java(Compiled > Code)) > at > org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java(Compiled > Code)) > at > org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java(Compiled > Code)) > at > org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java(Compiled > Code)) > at > org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java(Inlined > Compiled Code)) > at > org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java(Compiled > Code)) > 2007-11-04 10:16:26,921-0600 WARN RestartLog Failed to acquire > exclusive lock on log file. > 2007-11-04 10:16:26,999-0600 INFO vdl:execute START thread=0-4 tr=angle4 > 2007-11-04 10:16:26,999-0600 INFO vdl:execute START thread=0-1 tr=angle4 > 2007-11-04 10:16:26,999-0600 INFO vdl:execute START thread=0-5 tr=angle4 > 2007-11-04 10:16:27,000-0600 INFO vdl:execute START thread=0-3 tr=angle4 > 2007-11-04 10:16:27,001-0600 INFO vdl:execute START thread=0-2 tr=angle4 > > > > On 11/5/07 6:46 PM, Michael Wilde wrote: > > Im running on tg-login.uc.teragrid.org (IA64) and seeing very long swift > > start times. > > > > My swift command wrapper prints the wf start and end times with the > > swift stdout sandwiched in between. Here's an example of those, > > followed by the swift log file. In this run, i start swift in the > > background, then tail the stdout file. it was about 70 seconds (on my > > watch) before swift responded with its initial messages on stdout. (I > > dont think its being buffered, but thats worth checking...) > > > > Note that swift was launched at 18:30:49 and its logfile entry with the > > runid came at 18:32:05. 32:05-30:49 = 76 seconds! > > > > This was swift 1456 compiled on terminable (or login, i forget). > > > > Suspicious: when I was running a version compiled in tg-login under Java > > 1.4 I would get an error message from a Java method trying to lock the > > log file. Not sure if this logging action (which now does not give a > > message) is related to this slow start time. > > > > - Mike > > > > UC64$ cat swift.out > > Swift Script local-noks.swift starting at Mon Nov 5 18:30:49 CST 2007 > > > > Swift v0.3-dev r1456 > > > > RunID: 20071105-1831-d7t5l2n3 > > angle4 started > > angle4 started > > angle4 started > > angle4 started > > angle4 started > > angle4 completed > > angle4 completed > > angle4 completed > > angle4 completed > > angle4 completed > > > > Swift Script local-noks.swift ended at Mon Nov 5 18:32:34 CST 2007 with > > exit code 0 > > > > > > UC64$ head awf*.log > > 2007-11-05 18:31:09,566-0600 INFO Loader awf6.swift: source file is > > new. Recompiling. > > 2007-11-05 18:31:30,454-0600 INFO Karajan Validation of XML > > intermediate file was successful > > 2007-11-05 18:31:55,465-0600 INFO unknown Using sites file: ./sites.xml > > 2007-11-05 18:31:55,466-0600 INFO unknown Using tc.data: ./tc.data > > 2007-11-05 18:32:05,518-0600 INFO unknown Swift v0.3-dev r1456 > > > > 2007-11-05 18:32:05,520-0600 INFO unknown RunID: 20071105-1831-d7t5l2n3 > > 2007-11-05 18:32:06,701-0600 INFO AbstractDataNode Found data > > pcapfiles.$[]/1.[0] > > 2007-11-05 18:32:06,702-0600 INFO AbstractDataNode Found data > > pcapfiles.$[]/2.[1] > > 2007-11-05 18:32:06,702-0600 INFO AbstractDataNode Found data > > pcapfiles.$[]/3.[2] > > UC64$ > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Mon Nov 5 20:45:29 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 6 Nov 2007 02:45:29 +0000 (GMT) Subject: [Swift-devel] filesys_mapper doesnt take structured filenames In-Reply-To: <472FC11E.2090300@mcs.anl.gov> References: <472FC11E.2090300@mcs.anl.gov> Message-ID: On Mon, 5 Nov 2007, Michael Wilde wrote: > Ben, did you say that this mapper invocation *should* take directories? no. -- From wilde at mcs.anl.gov Mon Nov 5 21:43:51 2007 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 05 Nov 2007 21:43:51 -0600 Subject: [Swift-devel] Re: high load on tg-grid1 In-Reply-To: <0CB49B83-3E9A-4D55-8362-1FFE1396F2FF@mcs.anl.gov> References: <0CB49B83-3E9A-4D55-8362-1FFE1396F2FF@mcs.anl.gov> Message-ID: <472FE2F7.10407@mcs.anl.gov> Joe, I started a workflow with 1000 jobs - most likely thats what caused this. I need to check the throttles on this workflow - its possible they were open too wide. Another possibility - not sure if this was cause or effect - was that I got hundreds of messages from PBS (job aborted messages) of the form that I reported to help at tg yesterday. Im about to investigate the logs, but all my jobs are out of the queue now, and the workflow has completed. (Ben: I'll be filing the log momentarily after I do an initial check of it. Of 1000 jobs I got about 533 result datasets returned. This was w/o clustering). I got 396 emails from PBS. - Mike (Ti: responding to tg-support as thats where Joe sent this...) On 11/5/07 9:15 PM, joseph insley wrote: > I'm not sure what was causing this, but the load on tg-grid1 spiked at > over 200 a short while ago. It's coming back down now, but while it was > high I tried to submit a job through GRAM (pre-WS) and after a long wait > I got the error "GRAM Job submission failed because an I/O operation > failed (error code 3)" > > At the time there were a number of globus-job-manager processes > belonging to Mike Wilde, but only on the order of ~30something.. it > doesn't seem like this should cause such a high load, so I don't know > what was up... > > joe. > > =================================================== > joseph a. insley > insley at mcs.anl.gov > mathematics & computer science division (630) 252-5649 > argonne national laboratory (630) 252-5986 > (fax) > > > From wilde at mcs.anl.gov Mon Nov 5 22:01:08 2007 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 05 Nov 2007 22:01:08 -0600 Subject: [Swift-devel] slow swift startup time In-Reply-To: <1194317143.8333.3.camel@blabla.mcs.anl.gov> References: <472FB948.8010605@mcs.anl.gov> <472FBBCD.3020908@mcs.anl.gov> <1194317143.8333.3.camel@blabla.mcs.anl.gov> Message-ID: <472FE704.7080605@mcs.anl.gov> Like I said when I sent it - I wasnt complaining about it as an error; I was wondering if the *fact* that its requesting a lock could be causing a delay. (Eg, trying to lock, sleeping, failing, and then continuing). I was just trying to look for an explanation of why startup is slow. - Mike On 11/5/07 8:45 PM, Mihael Hategan wrote: > It's a warning. warning != error. > > On Mon, 2007-11-05 at 18:56 -0600, Michael Wilde wrote: >> The lock error I was referring to is this, on stdout/err: >> >> Failed to acquire exclusive lock on log file. >> >> Below is the log file text that accompanied it. >> >> Note that Im not complaining about this error - it went away when I >> started compiling on terminable again. >> >> Im just pointing it out as a suspect in the slow startup. It surprised >> me that we bother to lock the logfile, unless Java is gratuituously doig >> it for us. >> >> - Mike >> >> Logfile head showed: >> >> 2007-11-04 10:15:28,949-0600 INFO Loader awf3.swift: source file is >> new. Recompiling. >> 2007-11-04 10:15:31,315-0600 INFO Karajan Validation of XML >> intermediate file was successful >> 2007-11-04 10:15:34,597-0600 INFO unknown Using sites file: ./sites.xml >> 2007-11-04 10:15:34,599-0600 INFO unknown Using tc.data: ./tc.data >> 2007-11-04 10:15:36,606-0600 INFO unknown Swift v0.3-dev >> libexec/svn-revision: line 1: svn: command not found >> libexec/svn-revision: line 1: svn: command not found >> >> >> 2007-11-04 10:15:36,607-0600 INFO unknown RunID: 20071104-1015-afgc18i3 >> 2007-11-04 10:15:36,825-0600 INFO AbstractDataNode Found data >> pcapfiles.$[]/1.[0] >> 2007-11-04 10:15:36,825-0600 INFO AbstractDataNode Found data >> pcapfiles.$[]/2.[1] >> 2007-11-04 10:15:36,825-0600 INFO AbstractDataNode Found data >> pcapfiles.$[]/3.[2] >> 2007-11-04 10:15:36,825-0600 INFO AbstractDataNode Found data >> pcapfiles.$[]/4.[3] >> 2007-11-04 10:15:36,825-0600 INFO AbstractDataNode Found data >> pcapfiles.$[]/5.[4] >> 2007-11-04 10:16:26,917-0600 INFO FlushableLockedFileWriter Could not >> acquire lock on /home/wilde/angle/data/./awf3-20071104-1015 >> -afgc18i3.0.rlog >> java.io.IOException: No locks available >> at sun.nio.ch.FileChannelImpl.lock0(Native Method) >> at sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:804) >> at java.nio.channels.FileChannel.tryLock(FileChannel.java:983) >> at >> org.globus.cog.karajan.workflow.nodes.restartLog.FlushableLockedFileWriter.(FlushableLockedFileWriter.java:26) >> at >> org.globus.cog.karajan.workflow.nodes.restartLog.RestartLog.create(RestartLog.java:123) >> at >> org.globus.cog.karajan.workflow.nodes.restartLog.RestartLog.partialArgumentsEvaluated(RestartLog.java:55) >> at >> org.globus.cog.karajan.workflow.nodes.PartialArgumentsContainer.executeChildren(PartialArgumentsContainer.java:51) >> at >> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java(Compiled >> Code)) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java(Compiled >> Code)) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java(Inlined >> Compiled Code)) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java(Compiled >> Code)) >> at >> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java(Compiled >> Code)) >> at >> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java(Compiled >> Code)) >> at >> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java(Compiled >> Code)) >> at >> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java(Inlined >> Compiled Code)) >> at >> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java(Compiled >> Code)) >> 2007-11-04 10:16:26,921-0600 WARN RestartLog Failed to acquire >> exclusive lock on log file. >> 2007-11-04 10:16:26,999-0600 INFO vdl:execute START thread=0-4 tr=angle4 >> 2007-11-04 10:16:26,999-0600 INFO vdl:execute START thread=0-1 tr=angle4 >> 2007-11-04 10:16:26,999-0600 INFO vdl:execute START thread=0-5 tr=angle4 >> 2007-11-04 10:16:27,000-0600 INFO vdl:execute START thread=0-3 tr=angle4 >> 2007-11-04 10:16:27,001-0600 INFO vdl:execute START thread=0-2 tr=angle4 >> >> >> >> On 11/5/07 6:46 PM, Michael Wilde wrote: >>> Im running on tg-login.uc.teragrid.org (IA64) and seeing very long swift >>> start times. >>> >>> My swift command wrapper prints the wf start and end times with the >>> swift stdout sandwiched in between. Here's an example of those, >>> followed by the swift log file. In this run, i start swift in the >>> background, then tail the stdout file. it was about 70 seconds (on my >>> watch) before swift responded with its initial messages on stdout. (I >>> dont think its being buffered, but thats worth checking...) >>> >>> Note that swift was launched at 18:30:49 and its logfile entry with the >>> runid came at 18:32:05. 32:05-30:49 = 76 seconds! >>> >>> This was swift 1456 compiled on terminable (or login, i forget). >>> >>> Suspicious: when I was running a version compiled in tg-login under Java >>> 1.4 I would get an error message from a Java method trying to lock the >>> log file. Not sure if this logging action (which now does not give a >>> message) is related to this slow start time. >>> >>> - Mike >>> >>> UC64$ cat swift.out >>> Swift Script local-noks.swift starting at Mon Nov 5 18:30:49 CST 2007 >>> >>> Swift v0.3-dev r1456 >>> >>> RunID: 20071105-1831-d7t5l2n3 >>> angle4 started >>> angle4 started >>> angle4 started >>> angle4 started >>> angle4 started >>> angle4 completed >>> angle4 completed >>> angle4 completed >>> angle4 completed >>> angle4 completed >>> >>> Swift Script local-noks.swift ended at Mon Nov 5 18:32:34 CST 2007 with >>> exit code 0 >>> >>> >>> UC64$ head awf*.log >>> 2007-11-05 18:31:09,566-0600 INFO Loader awf6.swift: source file is >>> new. Recompiling. >>> 2007-11-05 18:31:30,454-0600 INFO Karajan Validation of XML >>> intermediate file was successful >>> 2007-11-05 18:31:55,465-0600 INFO unknown Using sites file: ./sites.xml >>> 2007-11-05 18:31:55,466-0600 INFO unknown Using tc.data: ./tc.data >>> 2007-11-05 18:32:05,518-0600 INFO unknown Swift v0.3-dev r1456 >>> >>> 2007-11-05 18:32:05,520-0600 INFO unknown RunID: 20071105-1831-d7t5l2n3 >>> 2007-11-05 18:32:06,701-0600 INFO AbstractDataNode Found data >>> pcapfiles.$[]/1.[0] >>> 2007-11-05 18:32:06,702-0600 INFO AbstractDataNode Found data >>> pcapfiles.$[]/2.[1] >>> 2007-11-05 18:32:06,702-0600 INFO AbstractDataNode Found data >>> pcapfiles.$[]/3.[2] >>> UC64$ >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >>> >>> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel >> > > From hategan at mcs.anl.gov Mon Nov 5 22:03:44 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 05 Nov 2007 22:03:44 -0600 Subject: [Swift-devel] slow swift startup time In-Reply-To: <472FE704.7080605@mcs.anl.gov> References: <472FB948.8010605@mcs.anl.gov> <472FBBCD.3020908@mcs.anl.gov> <1194317143.8333.3.camel@blabla.mcs.anl.gov> <472FE704.7080605@mcs.anl.gov> Message-ID: <1194321824.11190.1.camel@blabla.mcs.anl.gov> On Mon, 2007-11-05 at 22:01 -0600, Michael Wilde wrote: > Like I said when I sent it - I wasnt complaining about it as an error; I > was wondering if the *fact* that its requesting a lock could be causing > a delay. (Eg, trying to lock, sleeping, failing, and then continuing). I guess it depends on the fs. If you want to eliminate the possibility, run it from a local fs. > > I was just trying to look for an explanation of why startup is slow. > > - Mike > > > On 11/5/07 8:45 PM, Mihael Hategan wrote: > > It's a warning. warning != error. > > > > On Mon, 2007-11-05 at 18:56 -0600, Michael Wilde wrote: > >> The lock error I was referring to is this, on stdout/err: > >> > >> Failed to acquire exclusive lock on log file. > >> > >> Below is the log file text that accompanied it. > >> > >> Note that Im not complaining about this error - it went away when I > >> started compiling on terminable again. > >> > >> Im just pointing it out as a suspect in the slow startup. It surprised > >> me that we bother to lock the logfile, unless Java is gratuituously doig > >> it for us. > >> > >> - Mike > >> > >> Logfile head showed: > >> > >> 2007-11-04 10:15:28,949-0600 INFO Loader awf3.swift: source file is > >> new. Recompiling. > >> 2007-11-04 10:15:31,315-0600 INFO Karajan Validation of XML > >> intermediate file was successful > >> 2007-11-04 10:15:34,597-0600 INFO unknown Using sites file: ./sites.xml > >> 2007-11-04 10:15:34,599-0600 INFO unknown Using tc.data: ./tc.data > >> 2007-11-04 10:15:36,606-0600 INFO unknown Swift v0.3-dev > >> libexec/svn-revision: line 1: svn: command not found > >> libexec/svn-revision: line 1: svn: command not found > >> > >> > >> 2007-11-04 10:15:36,607-0600 INFO unknown RunID: 20071104-1015-afgc18i3 > >> 2007-11-04 10:15:36,825-0600 INFO AbstractDataNode Found data > >> pcapfiles.$[]/1.[0] > >> 2007-11-04 10:15:36,825-0600 INFO AbstractDataNode Found data > >> pcapfiles.$[]/2.[1] > >> 2007-11-04 10:15:36,825-0600 INFO AbstractDataNode Found data > >> pcapfiles.$[]/3.[2] > >> 2007-11-04 10:15:36,825-0600 INFO AbstractDataNode Found data > >> pcapfiles.$[]/4.[3] > >> 2007-11-04 10:15:36,825-0600 INFO AbstractDataNode Found data > >> pcapfiles.$[]/5.[4] > >> 2007-11-04 10:16:26,917-0600 INFO FlushableLockedFileWriter Could not > >> acquire lock on /home/wilde/angle/data/./awf3-20071104-1015 > >> -afgc18i3.0.rlog > >> java.io.IOException: No locks available > >> at sun.nio.ch.FileChannelImpl.lock0(Native Method) > >> at sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:804) > >> at java.nio.channels.FileChannel.tryLock(FileChannel.java:983) > >> at > >> org.globus.cog.karajan.workflow.nodes.restartLog.FlushableLockedFileWriter.(FlushableLockedFileWriter.java:26) > >> at > >> org.globus.cog.karajan.workflow.nodes.restartLog.RestartLog.create(RestartLog.java:123) > >> at > >> org.globus.cog.karajan.workflow.nodes.restartLog.RestartLog.partialArgumentsEvaluated(RestartLog.java:55) > >> at > >> org.globus.cog.karajan.workflow.nodes.PartialArgumentsContainer.executeChildren(PartialArgumentsContainer.java:51) > >> at > >> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java(Compiled > >> Code)) > >> at > >> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java(Compiled > >> Code)) > >> at > >> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java(Inlined > >> Compiled Code)) > >> at > >> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java(Compiled > >> Code)) > >> at > >> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java(Compiled > >> Code)) > >> at > >> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java(Compiled > >> Code)) > >> at > >> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java(Compiled > >> Code)) > >> at > >> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java(Inlined > >> Compiled Code)) > >> at > >> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java(Compiled > >> Code)) > >> 2007-11-04 10:16:26,921-0600 WARN RestartLog Failed to acquire > >> exclusive lock on log file. > >> 2007-11-04 10:16:26,999-0600 INFO vdl:execute START thread=0-4 tr=angle4 > >> 2007-11-04 10:16:26,999-0600 INFO vdl:execute START thread=0-1 tr=angle4 > >> 2007-11-04 10:16:26,999-0600 INFO vdl:execute START thread=0-5 tr=angle4 > >> 2007-11-04 10:16:27,000-0600 INFO vdl:execute START thread=0-3 tr=angle4 > >> 2007-11-04 10:16:27,001-0600 INFO vdl:execute START thread=0-2 tr=angle4 > >> > >> > >> > >> On 11/5/07 6:46 PM, Michael Wilde wrote: > >>> Im running on tg-login.uc.teragrid.org (IA64) and seeing very long swift > >>> start times. > >>> > >>> My swift command wrapper prints the wf start and end times with the > >>> swift stdout sandwiched in between. Here's an example of those, > >>> followed by the swift log file. In this run, i start swift in the > >>> background, then tail the stdout file. it was about 70 seconds (on my > >>> watch) before swift responded with its initial messages on stdout. (I > >>> dont think its being buffered, but thats worth checking...) > >>> > >>> Note that swift was launched at 18:30:49 and its logfile entry with the > >>> runid came at 18:32:05. 32:05-30:49 = 76 seconds! > >>> > >>> This was swift 1456 compiled on terminable (or login, i forget). > >>> > >>> Suspicious: when I was running a version compiled in tg-login under Java > >>> 1.4 I would get an error message from a Java method trying to lock the > >>> log file. Not sure if this logging action (which now does not give a > >>> message) is related to this slow start time. > >>> > >>> - Mike > >>> > >>> UC64$ cat swift.out > >>> Swift Script local-noks.swift starting at Mon Nov 5 18:30:49 CST 2007 > >>> > >>> Swift v0.3-dev r1456 > >>> > >>> RunID: 20071105-1831-d7t5l2n3 > >>> angle4 started > >>> angle4 started > >>> angle4 started > >>> angle4 started > >>> angle4 started > >>> angle4 completed > >>> angle4 completed > >>> angle4 completed > >>> angle4 completed > >>> angle4 completed > >>> > >>> Swift Script local-noks.swift ended at Mon Nov 5 18:32:34 CST 2007 with > >>> exit code 0 > >>> > >>> > >>> UC64$ head awf*.log > >>> 2007-11-05 18:31:09,566-0600 INFO Loader awf6.swift: source file is > >>> new. Recompiling. > >>> 2007-11-05 18:31:30,454-0600 INFO Karajan Validation of XML > >>> intermediate file was successful > >>> 2007-11-05 18:31:55,465-0600 INFO unknown Using sites file: ./sites.xml > >>> 2007-11-05 18:31:55,466-0600 INFO unknown Using tc.data: ./tc.data > >>> 2007-11-05 18:32:05,518-0600 INFO unknown Swift v0.3-dev r1456 > >>> > >>> 2007-11-05 18:32:05,520-0600 INFO unknown RunID: 20071105-1831-d7t5l2n3 > >>> 2007-11-05 18:32:06,701-0600 INFO AbstractDataNode Found data > >>> pcapfiles.$[]/1.[0] > >>> 2007-11-05 18:32:06,702-0600 INFO AbstractDataNode Found data > >>> pcapfiles.$[]/2.[1] > >>> 2007-11-05 18:32:06,702-0600 INFO AbstractDataNode Found data > >>> pcapfiles.$[]/3.[2] > >>> UC64$ > >>> _______________________________________________ > >>> Swift-devel mailing list > >>> Swift-devel at ci.uchicago.edu > >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >>> > >>> > >> _______________________________________________ > >> Swift-devel mailing list > >> Swift-devel at ci.uchicago.edu > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > >> > > > > > From wilde at mcs.anl.gov Mon Nov 5 22:27:55 2007 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 05 Nov 2007 22:27:55 -0600 Subject: [Swift-devel] 1000-job angle workflow gets high failure rate In-Reply-To: <472FE2F7.10407@mcs.anl.gov> References: <0CB49B83-3E9A-4D55-8362-1FFE1396F2FF@mcs.anl.gov> <472FE2F7.10407@mcs.anl.gov> Message-ID: <472FED4B.209@mcs.anl.gov> Was: Re: [Swift-devel] Re: high load on tg-grid1 Ben, the logs of my first 1000-job run for this week is in swift-logs/wilde/run153. This run shows a high volume (396) of the same emailed PBS error "Aborted by PBS Server" that I first saw on Saturday night. (although it turns out I now see these sporadically in my email going back to august) It produced 469 kickstart records and 1064 (out of 2000) data files. I assume the data files came in pairs, that would be 532 succeeding jobs. Its odd that 469+532=1001, but perhaps a coincidence. Im not going to take this log apart yet; first I want to rerun with clustering, and check my throttles. Possible that throttles open too wide are causing the PBS failures? Possible the same issue for the 5-wide angle run??? Also: I thought kickstart recs would get returned in a directory tree, no? Lastly, I'd like to get the input files mapped from a tree structure. Can structured_regexp_mapper do that? Ie, can I set its source to a dir rather than a swift variable? (You might have explained that, but I didnt get it in my notes). If the args to this have some powerful variations, can you fire off a note describing? Thanks, Mike On 11/5/07 9:43 PM, Michael Wilde wrote: > Joe, I started a workflow with 1000 jobs - most likely thats what caused > this. I need to check the throttles on this workflow - its possible they > were open too wide. > > Another possibility - not sure if this was cause or effect - was that I > got hundreds of messages from PBS (job aborted messages) of the form > that I reported to help at tg yesterday. > > Im about to investigate the logs, but all my jobs are out of the queue > now, and the workflow has completed. > > (Ben: I'll be filing the log momentarily after I do an initial check of > it. Of 1000 jobs I got about 533 result datasets returned. This was w/o > clustering). I got 396 emails from PBS. > > - Mike > > (Ti: responding to tg-support as thats where Joe sent this...) > > On 11/5/07 9:15 PM, joseph insley wrote: >> I'm not sure what was causing this, but the load on tg-grid1 spiked at >> over 200 a short while ago. It's coming back down now, but while it >> was high I tried to submit a job through GRAM (pre-WS) and after a >> long wait I got the error "GRAM Job submission failed because an I/O >> operation failed (error code 3)" >> >> At the time there were a number of globus-job-manager processes >> belonging to Mike Wilde, but only on the order of ~30something.. it >> doesn't seem like this should cause such a high load, so I don't know >> what was up... >> >> joe. >> >> =================================================== >> joseph a. insley >> insley at mcs.anl.gov >> mathematics & computer science division (630) 252-5649 >> argonne national laboratory (630) >> 252-5986 (fax) >> >> >> > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > From benc at hawaga.org.uk Mon Nov 5 22:04:24 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 6 Nov 2007 04:04:24 +0000 (GMT) Subject: [Swift-devel] Re: high load on tg-grid1 In-Reply-To: <472FE2F7.10407@mcs.anl.gov> References: <0CB49B83-3E9A-4D55-8362-1FFE1396F2FF@mcs.anl.gov> <472FE2F7.10407@mcs.anl.gov> Message-ID: Did you run this with swift default throttling? If so, I'm interested to see the swift site scores. On Mon, 5 Nov 2007, Michael Wilde wrote: > Joe, I started a workflow with 1000 jobs - most likely thats what caused this. > I need to check the throttles on this workflow - its possible they were open > too wide. > > Another possibility - not sure if this was cause or effect - was that I got > hundreds of messages from PBS (job aborted messages) of the form that I > reported to help at tg yesterday. > > Im about to investigate the logs, but all my jobs are out of the queue now, > and the workflow has completed. > > (Ben: I'll be filing the log momentarily after I do an initial check of it. Of > 1000 jobs I got about 533 result datasets returned. This was w/o clustering). > I got 396 emails from PBS. > > - Mike > > (Ti: responding to tg-support as thats where Joe sent this...) > > On 11/5/07 9:15 PM, joseph insley wrote: > > I'm not sure what was causing this, but the load on tg-grid1 spiked at over > > 200 a short while ago. It's coming back down now, but while it was high I > > tried to submit a job through GRAM (pre-WS) and after a long wait I got the > > error "GRAM Job submission failed because an I/O operation failed (error > > code 3)" > > > > At the time there were a number of globus-job-manager processes belonging to > > Mike Wilde, but only on the order of ~30something.. it doesn't seem like > > this should cause such a high load, so I don't know what was up... > > > > joe. > > > > =================================================== > > joseph a. insley > > insley at mcs.anl.gov > > mathematics & computer science division (630) 252-5649 > > argonne national laboratory (630) 252-5986 > > (fax) > > > > > > > > From wilde at mcs.anl.gov Mon Nov 5 22:38:37 2007 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 05 Nov 2007 22:38:37 -0600 Subject: [Swift-devel] Re: high load on tg-grid1 In-Reply-To: References: <0CB49B83-3E9A-4D55-8362-1FFE1396F2FF@mcs.anl.gov> <472FE2F7.10407@mcs.anl.gov> Message-ID: <472FEFCD.6070904@mcs.anl.gov> I ran it with the options in the swift.properties file of that log dir run153. Ive been using these for a bit, you'll need to check there what the settings were. If you suggest new ones I'll try them now when I set up a clustered run. Any suggestion on clustering size? Also: this, unlike previous runs, is running a dummy (sleep) angle job. What should I set that simulated run time to? The real angle run time is O(60 seconds). Want it "real" or "faster"? - Mike On 11/5/07 10:04 PM, Ben Clifford wrote: > Did you run this with swift default throttling? If so, I'm interested to > see the swift site scores. > > On Mon, 5 Nov 2007, Michael Wilde wrote: > >> Joe, I started a workflow with 1000 jobs - most likely thats what caused this. >> I need to check the throttles on this workflow - its possible they were open >> too wide. >> >> Another possibility - not sure if this was cause or effect - was that I got >> hundreds of messages from PBS (job aborted messages) of the form that I >> reported to help at tg yesterday. >> >> Im about to investigate the logs, but all my jobs are out of the queue now, >> and the workflow has completed. >> >> (Ben: I'll be filing the log momentarily after I do an initial check of it. Of >> 1000 jobs I got about 533 result datasets returned. This was w/o clustering). >> I got 396 emails from PBS. >> >> - Mike >> >> (Ti: responding to tg-support as thats where Joe sent this...) >> >> On 11/5/07 9:15 PM, joseph insley wrote: >>> I'm not sure what was causing this, but the load on tg-grid1 spiked at over >>> 200 a short while ago. It's coming back down now, but while it was high I >>> tried to submit a job through GRAM (pre-WS) and after a long wait I got the >>> error "GRAM Job submission failed because an I/O operation failed (error >>> code 3)" >>> >>> At the time there were a number of globus-job-manager processes belonging to >>> Mike Wilde, but only on the order of ~30something.. it doesn't seem like >>> this should cause such a high load, so I don't know what was up... >>> >>> joe. >>> >>> =================================================== >>> joseph a. insley >>> insley at mcs.anl.gov >>> mathematics & computer science division (630) 252-5649 >>> argonne national laboratory (630) 252-5986 >>> (fax) >>> >>> >>> >> > > From wilde at mcs.anl.gov Mon Nov 5 23:02:53 2007 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 05 Nov 2007 23:02:53 -0600 Subject: [Swift-devel] Re: high load on tg-grid1 In-Reply-To: References: <0CB49B83-3E9A-4D55-8362-1FFE1396F2FF@mcs.anl.gov> <472FE2F7.10407@mcs.anl.gov> Message-ID: <472FF57D.9040803@mcs.anl.gov> The job throttles were all set to off - thats the way I had set them for Falkon and I forgot to change them for PBS. The data throttles were set to the defaults. I'll start the next run with clustering with all throttles set to default, unless you suggest different (in time) Note that my swift.properties is a subset of the full file (I only include ones I plan to mess with). - Mike On 11/5/07 10:04 PM, Ben Clifford wrote: > Did you run this with swift default throttling? If so, I'm interested to > see the swift site scores. > > On Mon, 5 Nov 2007, Michael Wilde wrote: > >> Joe, I started a workflow with 1000 jobs - most likely thats what caused this. >> I need to check the throttles on this workflow - its possible they were open >> too wide. >> >> Another possibility - not sure if this was cause or effect - was that I got >> hundreds of messages from PBS (job aborted messages) of the form that I >> reported to help at tg yesterday. >> >> Im about to investigate the logs, but all my jobs are out of the queue now, >> and the workflow has completed. >> >> (Ben: I'll be filing the log momentarily after I do an initial check of it. Of >> 1000 jobs I got about 533 result datasets returned. This was w/o clustering). >> I got 396 emails from PBS. >> >> - Mike >> >> (Ti: responding to tg-support as thats where Joe sent this...) >> >> On 11/5/07 9:15 PM, joseph insley wrote: >>> I'm not sure what was causing this, but the load on tg-grid1 spiked at over >>> 200 a short while ago. It's coming back down now, but while it was high I >>> tried to submit a job through GRAM (pre-WS) and after a long wait I got the >>> error "GRAM Job submission failed because an I/O operation failed (error >>> code 3)" >>> >>> At the time there were a number of globus-job-manager processes belonging to >>> Mike Wilde, but only on the order of ~30something.. it doesn't seem like >>> this should cause such a high load, so I don't know what was up... >>> >>> joe. >>> >>> =================================================== >>> joseph a. insley >>> insley at mcs.anl.gov >>> mathematics & computer science division (630) 252-5649 >>> argonne national laboratory (630) 252-5986 >>> (fax) >>> >>> >>> >> > > From wilde at mcs.anl.gov Mon Nov 5 23:56:57 2007 From: wilde at mcs.anl.gov (Michael Wilde) Date: Mon, 05 Nov 2007 23:56:57 -0600 Subject: [Swift-devel] angle-1000 second run Message-ID: <47300229.1040801@mcs.anl.gov> I just ran a second run of angle-1000, this time with clustering. I thought I had the throttles at default values but missed one. I killed the run after a few hundred data files were produced because it was running too slowly and seemed to have reached a steady state. The logs are in wilde/run154. Here;s what I noted seemed wrong with this run: 1. only 4 jobs max ran at a time (as seen by qstat over many many spot checks) 2. only ONE data file came back before I killed the run - yet hundreds were produced (as seen on the server size). Surely these should have started trickling in by now? 3. The cluster sizes were extremely small about 4 - should have been 10-20 by my calcs. 4. I still got over a dozen PBS job aborted messages -- Im going to start another run and let this one go till it finishes. I'll use totally default throttles and increase my cluster params (but I dont understand why the current values didnt work). One more note: this run is using executable script angle4.fast.sh which has a sleep 3 as its main action. It logs misc stuff to its 2 output files, but otherwise takes the same args as the real angle4.sh. Its running out of ~wilde/angle/data on tg-login1. - Mike From wilde at mcs.anl.gov Tue Nov 6 00:05:37 2007 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 06 Nov 2007 00:05:37 -0600 Subject: [Swift-devel] questions on swift properties Message-ID: <47300431.6050306@mcs.anl.gov> 2 questions: A few days ago I saw a spec for how to set GLOBUS::maxwalltime to values other than minutes. Eg 00:nn for seconds??? But I cant find that spec now. Can someone point me at it? I thought there was a new parameter to set the min wait time between submissions to GT2. I cant find that in either the etc/swift.properties sample or the userguide. Am I missing is or is it not documented? Please describe. Lastly - the wording of the throttle parameters in swift.properties confuses me even after reading them 10+ times. Im confused between max # of jobs that can be running, and the rate at which they can be submitted. I think these need to be reworded to clarify some confusion. Thanks. From wilde at mcs.anl.gov Tue Nov 6 00:17:19 2007 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 06 Nov 2007 00:17:19 -0600 Subject: [Swift-devel] angle-1000 3rd run Message-ID: <473006EF.6040408@mcs.anl.gov> Just started another angle-1000. I trimmed my properties file to this: -- kickstart.always.transfer=true clustering.enabled=true clustering.queue.delay=10 clustering.min.time=1200 sitedir.keep=true -- Clusters are still small so far - mostly 4. The runtime of each job is 3 secs. I set the maxwalltime to 1 (which I think is 60 seconds) until I can verify how to set this in seconds. The number of running jobs I see is still extremely low - 3 right now; was 1 and 2 for a while. The cluster is wide open - lots of free cpus, no queue. One improvement in this run: data seems to be flowing back almost from the start, unlike the previous run where almost no data result files had come back by the time i killed the wf. I'll let this one run as far as it goes, and check on it in the morning (it should push itself to swift-logs if/when it finishes). - Mike From wilde at mcs.anl.gov Tue Nov 6 06:31:36 2007 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 06 Nov 2007 06:31:36 -0600 Subject: [Swift-devel] angle-1000 3rd run Message-ID: <47305EA8.7080501@mcs.anl.gov> It stopped after producing ~360 output members because the credential expired. I'll need to check for that in my wrapper script. The logs are in in swift-logs/wilde/run155 Restarting it now. From wilde at mcs.anl.gov Tue Nov 6 07:19:40 2007 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 06 Nov 2007 07:19:40 -0600 Subject: [Swift-devel] angle-1000 4th run Message-ID: <473069EC.2010408@mcs.anl.gov> is running, but slowly: Same behavior: only 2-3 jobs running in two spot checks. CLuster size is still low. I must have a math error in my time specs. WIll check again. Clusters are starting every 10 secs but much smaller than expected/desired. Again, maxwalltime is 60 secs and swift.poperties cluster settings are: clustering.enabled=true clustering.queue.delay=10 clustering.min.time=1200 So I would have expected 20-40 jobs per cluster (2 x (1200/60)) - mike UC64$ head swift.out Swift script awf6.swift starting at Tue Nov 6 06:32:46 CST 2007 running on sites: UC-nfs-gt2-ks Swift v0.3-dev r1456 RunID: 20071106-0632-asuk0my2 angle4 started ... UC64$ qstat -u wilde tg-master.uc.teragrid.org: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------- ------ ----- --- ------ ----- - ----- 1574316.tg-master.uc wilde dque STDIN 20590 1 -- -- 00:15 R -- 1574318.tg-master.uc wilde dque STDIN 20666 1 -- -- 00:15 R -- UC64$ date Tue Nov 6 06:58:11 CST 2007 UC64$ grep -i cluster.*size a*.log 2007-11-06 06:34:12,327-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-0-1194352386645 with size 4 2007-11-06 06:34:22,326-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-1-1194352386651 with size 2 2007-11-06 06:35:32,332-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-2-1194352386875 with size 3 2007-11-06 06:35:42,332-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-3-1194352386883 with size 2 2007-11-06 06:35:52,333-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-4-1194352386974 with size 2 2007-11-06 06:36:02,334-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-5-1194352387000 with size 4 2007-11-06 06:36:22,335-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-6-1194352387074 with size 3 2007-11-06 06:36:32,336-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-7-1194352387112 with size 4 2007-11-06 06:36:52,337-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-8-1194352387329 with size 4 2007-11-06 06:37:52,344-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-9-1194352387719 with size 3 2007-11-06 06:38:52,350-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-10-1194352387823 with size 3 2007-11-06 06:39:52,356-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-11-1194352387927 with size 3 2007-11-06 06:40:52,362-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-12-1194352388031 with size 3 2007-11-06 06:41:52,369-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-13-1194352388135 with size 3 2007-11-06 06:42:52,376-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-14-1194352388239 with size 3 2007-11-06 06:43:52,382-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-15-1194352388343 with size 3 2007-11-06 06:45:02,389-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-16-1194352388447 with size 3 2007-11-06 06:46:02,396-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-17-1194352388551 with size 3 2007-11-06 06:47:02,403-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-18-1194352388655 with size 3 2007-11-06 06:48:02,409-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-19-1194352388759 with size 3 2007-11-06 06:49:02,415-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-20-1194352388863 with size 3 2007-11-06 06:50:02,421-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-21-1194352388975 with size 4 2007-11-06 06:51:12,428-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-22-1194352389137 with size 4 2007-11-06 06:51:22,428-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-23-1194352389147 with size 4 2007-11-06 06:52:22,434-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-24-1194352389489 with size 4 2007-11-06 06:52:42,435-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-25-1194352389503 with size 4 2007-11-06 06:52:52,517-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-26-1194352389514 with size 3 2007-11-06 06:53:12,517-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-27-1194352389579 with size 4 2007-11-06 06:53:32,519-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-28-1194352389593 with size 4 2007-11-06 06:53:42,520-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-29-1194352389703 with size 4 2007-11-06 06:54:02,521-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-30-1194352389765 with size 4 2007-11-06 06:54:22,523-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-31-1194352389952 with size 4 2007-11-06 06:54:32,523-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-32-1194352390996 with size 2 2007-11-06 06:54:42,523-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-33-1194352391028 with size 2 2007-11-06 06:54:52,524-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-34-1194352391084 with size 2 2007-11-06 06:55:02,524-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-35-1194352391098 with size 4 2007-11-06 06:55:22,630-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-36-1194352391181 with size 2 2007-11-06 06:55:32,633-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-37-1194352391192 with size 3 2007-11-06 06:55:42,635-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-38-1194352391224 with size 2 2007-11-06 06:55:52,635-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-39-1194352391244 with size 2 2007-11-06 06:56:03,050-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-40-1194352391252 with size 2 2007-11-06 06:56:12,956-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-41-1194352391290 with size 4 2007-11-06 06:56:22,956-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-42-1194352391334 with size 2 2007-11-06 06:56:32,957-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-43-1194352391366 with size 2 2007-11-06 06:56:42,958-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-44-1194352391404 with size 4 2007-11-06 06:57:02,960-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-45-1194352391466 with size 4 2007-11-06 06:57:12,960-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-46-1194352391525 with size 3 2007-11-06 06:57:32,961-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-47-1194352391614 with size 4 2007-11-06 06:57:52,964-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-48-1194352391682 with size 3 2007-11-06 06:58:02,980-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-49-1194352391735 with size 3 2007-11-06 06:58:22,982-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-50-1194352391818 with size 4 2007-11-06 06:58:32,982-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-51-1194352391829 with size 3 2007-11-06 06:58:52,982-0600 INFO VDSAdaptiveScheduler Creating cluster urn:cluster-52-1194352391894 with size 4 UC64$ From benc at hawaga.org.uk Tue Nov 6 08:11:12 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 6 Nov 2007 14:11:12 +0000 (GMT) Subject: [Swift-devel] Re: high load on tg-grid1 In-Reply-To: <472FF57D.9040803@mcs.anl.gov> References: <0CB49B83-3E9A-4D55-8362-1FFE1396F2FF@mcs.anl.gov> <472FE2F7.10407@mcs.anl.gov> <472FF57D.9040803@mcs.anl.gov> Message-ID: On Mon, 5 Nov 2007, Michael Wilde wrote: > The job throttles were all set to off - thats the way I had set them for > Falkon and I forgot to change them for PBS. Oh dear. > I'll start the next run with clustering with all throttles set to default, > unless you suggest different (in time) Run with the default values that are specified in the latest SVN swift.properties file. -- From benc at hawaga.org.uk Tue Nov 6 08:16:11 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 6 Nov 2007 14:16:11 +0000 (GMT) Subject: [Swift-devel] Re: angle-1000 second run In-Reply-To: <47300229.1040801@mcs.anl.gov> References: <47300229.1040801@mcs.anl.gov> Message-ID: On Mon, 5 Nov 2007, Michael Wilde wrote: > 1. only 4 jobs max ran at a time (as seen by qstat over many many spot checks) We can look at scoring from the run. > 2. only ONE data file came back before I killed the run - yet hundreds were > produced (as seen on the server size). Surely these should have started > trickling in by now? Not if jobs were still staging in - there's one file transfer throttle shared between all file transfers, and stageins submitted at the start are going to get serviced before stage outs. That should be apparent from a graph if I plot it. > 3. The cluster sizes were extremely small about 4 - should have been 10-20 by > my calcs. Increase the cluster queue delay parameter from 4 to about 30 (seconds). This will make Swift wait much longer before putting clusters together, which may allow more jobs to build up in the clustering queue. Make sure that you havethe cluster maximum time and maxwalltimes for jobs set to sensible values, because large clusters will highlight misconfigurations there. In particular, note that the maximum cluster time in the config file needs to be (less than) half of the maxwalltime permitted for the site you submit to (so if you are allowewd to run 15 minute jobs, set the cluster maximum time to 7*60, for example). Are you using the PBS provider or GRAM to submit? > > 4. I still got over a dozen PBS job aborted messages > > -- > > Im going to start another run and let this one go till it finishes. > > I'll use totally default throttles and increase my cluster params (but I dont > understand why the current values didnt work). > > One more note: this run is using executable script angle4.fast.sh which has a > sleep 3 as its main action. It logs misc stuff to its 2 output files, but > otherwise takes the same args as the real angle4.sh. > > Its running out of ~wilde/angle/data on tg-login1. > > - Mike > > > > From benc at hawaga.org.uk Tue Nov 6 08:32:22 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 6 Nov 2007 14:32:22 +0000 (GMT) Subject: [Swift-devel] questions on swift properties In-Reply-To: <47300431.6050306@mcs.anl.gov> References: <47300431.6050306@mcs.anl.gov> Message-ID: On Tue, 6 Nov 2007, Michael Wilde wrote: > A few days ago I saw a spec for how to set GLOBUS::maxwalltime to values other > than minutes. Eg 00:nn for seconds??? But I cant find that spec now. Can > someone point me at it? In the user guide, in the properties section in the globus subsection, it should be there. > I thought there was a new parameter to set the min wait time between > submissions to GT2. I cant find that in either the etc/swift.properties > sample or the userguide. Am I missing is or is it not documented? Please > describe. can't remember. > Lastly - the wording of the throttle parameters in swift.properties confuses > me even after reading them 10+ times. Im confused between max # of jobs that > can be running, and the rate at which they can be submitted. I think these > need to be reworded to clarify some confusion. yes. That's on my to-do list. For now, use the defaults and we can see what happens. -- From wilde at mcs.anl.gov Tue Nov 6 10:19:14 2007 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 06 Nov 2007 10:19:14 -0600 Subject: [Swift-devel] Re: angle-1000 second run In-Reply-To: References: <47300229.1040801@mcs.anl.gov> Message-ID: <47309402.9000606@mcs.anl.gov> >> 3. The cluster sizes were extremely small about 4 - should have been 10-20 by >> my calcs. > > Increase the cluster queue delay parameter from 4 to about 30 (seconds). > This will make Swift wait much longer before putting clusters together, > which may allow more jobs to build up in the clustering queue. Previous run had this set to 10 seconds. The logs confirm that this was the clustering period: the cluster size=4 message came out every 10 seconds. > Make sure that you havethe cluster maximum time and maxwalltimes for jobs > set to sensible values, because large clusters will highlight > misconfigurations there. In particular, note that the maximum cluster time > in the config file needs to be (less than) half of the maxwalltime > permitted for the site you submit to (so if you are allowewd to run 15 > minute jobs, set the cluster maximum time to 7*60, for example). I set cluster max time to 1200 with a maxwalltime of 60 seconds. I will fiddle with this part with smaller runs till it works. Likely I have a config issue somewhere, or theres a bug. > Are you using the PBS provider or GRAM to submit? GRAM, gt2. From wilde at mcs.anl.gov Tue Nov 6 10:28:36 2007 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 06 Nov 2007 10:28:36 -0600 Subject: [Swift-devel] Re: Jobs being aborted by PBS server on tg-grid.uc.teragrid.org In-Reply-To: <200711061619.lA6GJZmt028890@rimantadine.ncsa.uiuc.edu> References: <200711061619.lA6GJZmt028890@rimantadine.ncsa.uiuc.edu> Message-ID: <47309634.6090305@mcs.anl.gov> Excellent, thanks Ti. This explains many of our problems, I think. - Mike On 11/6/07 10:19 AM, help at teragrid.org wrote: > FROM: Leggett, Ti > (Concerning ticket No. 147814) > > I think I fixed this this morning. In all the cases you were given a node in which tg-grid1 could not > communicate with. If you still see this, immediately run: > > checkjob > > if you can and send the output. If you can't, send me the job ID. > > Michael Wilde writes: >> The errors below are from workflows of only 5 jobs. >> One job of the five failed in each of these 3 incidents. >> The failing job was then in each case retried twice more (automatically >> by Swift) >> >> GRAM was not failing to my knowledge during these times. >> >> Do the PBS logs indicate anything? >> >> - Mike >> >> >> On 11/6/07 9:52 AM, help at teragrid.org wrote: >>> FROM: Leggett, Ti >>> (Concerning ticket No. 147814) >>> >>> Are you getting these when you're submitting many (thousands) of jobs and does it coincide with > the >>> gatekeeper becoming unavailable? >>> >>> Michael Wilde writes: >>>> Im starting to see more frequent problems like this. >>>> Happened once last night to 3 consecutive jobs, and tonight happened >>>> twice, to 6 jobs. >>>> >>>> Ti, could you look in the PBS logs, possibly on the related node(s) and >>>> see if its looking like a problem on tg-uc or on our side? >>>> >>>> Thanks, >>>> >>>> Mike >>>> >>>> >>>> 11/3 8:05 PM - 3 failures >>>> Job IDs 1571647, 48, & 49 >>>> 11/4 7:46 PM - 3 failures >>>> Job IDs 1572031, 33, & 34 >>>> 11/4 8:56 - 8:57 PM >>>> 1572040, 42, 43 >>>> >>>> All errors have the format below. >>>> >>>> Swift retries failing jobs 3 times, hence the groups of 3 above. >>>> >>>> >>>> -------- Original Message -------- >>>> Subject: PBS JOB 1572043.tg-master.uc.teragrid.org >>>> Date: Sun, 4 Nov 2007 20:57:11 -0600 (CST) >>>> From: adm at tg-master.uc.teragrid.org (root) >>>> To: wilde at tg-grid1.uc.teragrid.org >>>> >>>> PBS Job Id: 1572043.tg-master.uc.teragrid.org >>>> Job Name: STDIN >>>> Aborted by PBS Server >>>> Job cannot be executed >>>> See Administrator for help >>> > > From benc at hawaga.org.uk Tue Nov 6 10:11:03 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 6 Nov 2007 16:11:03 +0000 (GMT) Subject: [Swift-devel] minimum rate limit patch for karajan Message-ID: Below is a patch that puts a lower bound on the site scoring in Karajan. This reduces catastrophic problems caused when a large number of jobs fail at once, pushing the site score so low that it never (during the workflow run) recovers. I think this would be useful for Mike in his angle workflows. Also at: http://www.ci.uchicago.edu/~benc/andrew-ratelimit-minimum Index: cog/modules/karajan/src/org/globus/cog/karajan/scheduler/WeightedHost.java =================================================================== --- cog.orig/modules/karajan/src/org/globus/cog/karajan/scheduler/WeightedHost.java 2007-07-13 11:16:11.000000000 +0100 +++ cog/modules/karajan/src/org/globus/cog/karajan/scheduler/WeightedHost.java 2007-10-29 21:30:43.000000000 +0000 @@ -38,6 +38,8 @@ } protected void setScore(double score) { + final int MINWEIGHT = -10; + if(score References: <47300229.1040801@mcs.anl.gov> <47309402.9000606@mcs.anl.gov> Message-ID: <4730A362.7020406@mcs.anl.gov> It seems that the cluster problem is also due to the slow speed of input data file stage-in. It took 6 minutes to stage in 60 40MB input files to uc-tg (this is to NFS; I will try GPFS as well). So at 10 files per minute, if we check the cluster queue every 30 seconds, that about 5 jobs per cluster on average, which explains what we're seeing. 10 fpm = 400MB/min = 6.5MB/sec. Note that Im submitting from the login node to the same cluster - seems very slow. I will test further and try to calibrate the expected speeds on a big file. - Mike On 11/6/07 10:19 AM, Michael Wilde wrote: > >>> 3. The cluster sizes were extremely small about 4 - should have been >>> 10-20 by >>> my calcs. >> >> Increase the cluster queue delay parameter from 4 to about 30 >> (seconds). This will make Swift wait much longer before putting >> clusters together, which may allow more jobs to build up in the >> clustering queue. > > Previous run had this set to 10 seconds. The logs confirm that this was > the clustering period: the cluster size=4 message came out every 10 > seconds. > >> Make sure that you havethe cluster maximum time and maxwalltimes for >> jobs set to sensible values, because large clusters will highlight >> misconfigurations there. In particular, note that the maximum cluster >> time in the config file needs to be (less than) half of the >> maxwalltime permitted for the site you submit to (so if you are >> allowewd to run 15 minute jobs, set the cluster maximum time to 7*60, >> for example). > > I set cluster max time to 1200 with a maxwalltime of 60 seconds. > > I will fiddle with this part with smaller runs till it works. > > Likely I have a config issue somewhere, or theres a bug. > >> Are you using the PBS provider or GRAM to submit? > > GRAM, gt2. > From hategan at mcs.anl.gov Tue Nov 6 11:32:05 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 06 Nov 2007 11:32:05 -0600 Subject: [Swift-devel] Re: angle-1000 second run In-Reply-To: <4730A362.7020406@mcs.anl.gov> References: <47300229.1040801@mcs.anl.gov> <47309402.9000606@mcs.anl.gov> <4730A362.7020406@mcs.anl.gov> Message-ID: <1194370326.16107.4.camel@blabla.mcs.anl.gov> On Tue, 2007-11-06 at 11:24 -0600, Michael Wilde wrote: > It seems that the cluster problem is also due to the slow speed of input > data file stage-in. Sounds likely. > > It took 6 minutes to stage in 60 40MB input files to uc-tg > (this is to NFS; I will try GPFS as well). > > So at 10 files per minute, if we check the cluster queue every 30 > seconds, that about 5 jobs per cluster on average, which explains what > we're seeing. > > 10 fpm = 400MB/min = 6.5MB/sec. Note that Im submitting from the login > node to the same cluster - seems very slow. You should also factor in protocol latencies and various things like directory creation/checks. > > I will test further and try to calibrate the expected speeds on a big file. > > - Mike > > > On 11/6/07 10:19 AM, Michael Wilde wrote: > > > >>> 3. The cluster sizes were extremely small about 4 - should have been > >>> 10-20 by > >>> my calcs. > >> > >> Increase the cluster queue delay parameter from 4 to about 30 > >> (seconds). This will make Swift wait much longer before putting > >> clusters together, which may allow more jobs to build up in the > >> clustering queue. > > > > Previous run had this set to 10 seconds. The logs confirm that this was > > the clustering period: the cluster size=4 message came out every 10 > > seconds. > > > >> Make sure that you havethe cluster maximum time and maxwalltimes for > >> jobs set to sensible values, because large clusters will highlight > >> misconfigurations there. In particular, note that the maximum cluster > >> time in the config file needs to be (less than) half of the > >> maxwalltime permitted for the site you submit to (so if you are > >> allowewd to run 15 minute jobs, set the cluster maximum time to 7*60, > >> for example). > > > > I set cluster max time to 1200 with a maxwalltime of 60 seconds. > > > > I will fiddle with this part with smaller runs till it works. > > > > Likely I have a config issue somewhere, or theres a bug. > > > >> Are you using the PBS provider or GRAM to submit? > > > > GRAM, gt2. > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Tue Nov 6 12:20:07 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 6 Nov 2007 18:20:07 +0000 (GMT) Subject: [Swift-devel] Re: angle-1000 second run In-Reply-To: <4730A362.7020406@mcs.anl.gov> References: <47300229.1040801@mcs.anl.gov> <47309402.9000606@mcs.anl.gov> <4730A362.7020406@mcs.anl.gov> Message-ID: hitting the transfer throttle a lot according to this: http://www.ci.uchicago.edu/~benc/log-processing/report-awf6-20071106-1101-yxipkgyg/ On Tue, 6 Nov 2007, Michael Wilde wrote: > It seems that the cluster problem is also due to the slow speed of input data > file stage-in. > > It took 6 minutes to stage in 60 40MB input files to uc-tg > (this is to NFS; I will try GPFS as well). > > So at 10 files per minute, if we check the cluster queue every 30 seconds, > that about 5 jobs per cluster on average, which explains what we're seeing. > > 10 fpm = 400MB/min = 6.5MB/sec. Note that Im submitting from the login node > to the same cluster - seems very slow. > > I will test further and try to calibrate the expected speeds on a big file. > > - Mike > > > On 11/6/07 10:19 AM, Michael Wilde wrote: > > > > > > 3. The cluster sizes were extremely small about 4 - should have been > > > > 10-20 by > > > > my calcs. > > > > > > Increase the cluster queue delay parameter from 4 to about 30 (seconds). > > > This will make Swift wait much longer before putting clusters together, > > > which may allow more jobs to build up in the clustering queue. > > > > Previous run had this set to 10 seconds. The logs confirm that this was the > > clustering period: the cluster size=4 message came out every 10 seconds. > > > > > Make sure that you havethe cluster maximum time and maxwalltimes for jobs > > > set to sensible values, because large clusters will highlight > > > misconfigurations there. In particular, note that the maximum cluster time > > > in the config file needs to be (less than) half of the maxwalltime > > > permitted for the site you submit to (so if you are allowewd to run 15 > > > minute jobs, set the cluster maximum time to 7*60, for example). > > > > I set cluster max time to 1200 with a maxwalltime of 60 seconds. > > > > I will fiddle with this part with smaller runs till it works. > > > > Likely I have a config issue somewhere, or theres a bug. > > > > > Are you using the PBS provider or GRAM to submit? > > > > GRAM, gt2. > > > > From hategan at mcs.anl.gov Tue Nov 6 12:37:50 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 06 Nov 2007 12:37:50 -0600 Subject: [Swift-devel] Re: angle-1000 second run In-Reply-To: References: <47300229.1040801@mcs.anl.gov> <47309402.9000606@mcs.anl.gov> <4730A362.7020406@mcs.anl.gov> Message-ID: <1194374270.18379.2.camel@blabla.mcs.anl.gov> So I just spoke to Bill. The errors we see when transfers go up, we should not see them. In the tests they've done a while ago hundreds of parallel transfers on typical machines were not a problem. So we need to isolate the issue. Possible causes: 1. The Java GridFTP client 2. The CI network 3. Problems introduced in the server after the tests above. Mihael On Tue, 2007-11-06 at 18:20 +0000, Ben Clifford wrote: > hitting the transfer throttle a lot according to this: > http://www.ci.uchicago.edu/~benc/log-processing/report-awf6-20071106-1101-yxipkgyg/ > > > On Tue, 6 Nov 2007, Michael Wilde wrote: > > > It seems that the cluster problem is also due to the slow speed of input data > > file stage-in. > > > > It took 6 minutes to stage in 60 40MB input files to uc-tg > > (this is to NFS; I will try GPFS as well). > > > > So at 10 files per minute, if we check the cluster queue every 30 seconds, > > that about 5 jobs per cluster on average, which explains what we're seeing. > > > > 10 fpm = 400MB/min = 6.5MB/sec. Note that Im submitting from the login node > > to the same cluster - seems very slow. > > > > I will test further and try to calibrate the expected speeds on a big file. > > > > - Mike > > > > > > On 11/6/07 10:19 AM, Michael Wilde wrote: > > > > > > > > 3. The cluster sizes were extremely small about 4 - should have been > > > > > 10-20 by > > > > > my calcs. > > > > > > > > Increase the cluster queue delay parameter from 4 to about 30 (seconds). > > > > This will make Swift wait much longer before putting clusters together, > > > > which may allow more jobs to build up in the clustering queue. > > > > > > Previous run had this set to 10 seconds. The logs confirm that this was the > > > clustering period: the cluster size=4 message came out every 10 seconds. > > > > > > > Make sure that you havethe cluster maximum time and maxwalltimes for jobs > > > > set to sensible values, because large clusters will highlight > > > > misconfigurations there. In particular, note that the maximum cluster time > > > > in the config file needs to be (less than) half of the maxwalltime > > > > permitted for the site you submit to (so if you are allowewd to run 15 > > > > minute jobs, set the cluster maximum time to 7*60, for example). > > > > > > I set cluster max time to 1200 with a maxwalltime of 60 seconds. > > > > > > I will fiddle with this part with smaller runs till it works. > > > > > > Likely I have a config issue somewhere, or theres a bug. > > > > > > > Are you using the PBS provider or GRAM to submit? > > > > > > GRAM, gt2. > > > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Tue Nov 6 14:32:07 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 6 Nov 2007 20:32:07 +0000 (GMT) Subject: [Swift-devel] unexpected event state sequences Message-ID: I've seen this a few times in different runs - file trasnfer task going through sequence of states Submitted -> Failed -> Active (my log processing assumes that active isn't a final state...) 2007-11-06 13:04:30,052-0600 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=ur n:0-127-1-1194375466307) setting status to Submitted 2007-11-06 13:04:30,056-0600 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=ur n:0-118-1-1194375466305) setting status to Failed Error communicating with the G ridFTP server 2007-11-06 13:04:30,057-0600 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=ur n:0-127-1-1194375466307) setting status to Failed Error communicating with the G ridFTP server 2007-11-06 13:04:30,321-0600 DEBUG TaskImpl Task(type=FILE_TRANSFER, identity=ur n:0-127-1-1194375466307) setting status to Active From wilde at mcs.anl.gov Tue Nov 6 23:36:19 2007 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 06 Nov 2007 23:36:19 -0600 Subject: [Swift-devel] started angle-1000 using ci-san data and ext mapper Message-ID: <47314ED3.7010404@mcs.anl.gov> started this running at 23:07, ~wilde/angle/data. processing the files of spool_1 and spool_2 and naming the outputs accordingly. I spotted several failures to create a kickstart dir in the prior run (which i killed) and at least one such error in this run. Ive gotten about 10 failures so far from PBS aborts; looks like a node is bad again (sent mail). Added a zcat to the app to decompress the uic data. Am using Mihael's ext mapper, and its working great so far. This is what my *entire* mapper looks like: -- #! /bin/sh awk References: <47314ED3.7010404@mcs.anl.gov> Message-ID: On Tue, 6 Nov 2007, Michael Wilde wrote: > Ive gotten about 10 failures so far from PBS aborts; looks like a node is bad > again (sent mail). 713 attempts to run jobs worked, 416 failed. (that's at the execute2 level) Looks like a combination of file transfer failures and job execution failures. -- From wilde at mcs.anl.gov Wed Nov 7 08:32:24 2007 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 07 Nov 2007 08:32:24 -0600 Subject: [Swift-devel] Re: started angle-1000 using ci-san data and ext mapper In-Reply-To: References: <47314ED3.7010404@mcs.anl.gov> Message-ID: <4731CC78.6080108@mcs.anl.gov> In IM Ben said: -- Ben Clifford its possible to change swift to retry jobs more than 3 times. i did that with andrew with it up at 10 sometimes jobs were running 5 times or so it doesn't fix broken nodes but it increases chances of workflow completion. -- Sounds good, will try. With this kind of cluster problem, there's little else we can do from outside the cluster. On 11/7/07 8:13 AM, Ben Clifford wrote: > > On Tue, 6 Nov 2007, Michael Wilde wrote: > >> Ive gotten about 10 failures so far from PBS aborts; looks like a node is bad >> again (sent mail). > > 713 attempts to run jobs worked, 416 failed. (that's at the execute2 > level) > > Looks like a combination of file transfer failures and job execution > failures. > From benc at hawaga.org.uk Wed Nov 7 08:36:36 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 7 Nov 2007 14:36:36 +0000 (GMT) Subject: [Swift-devel] Re: started angle-1000 using ci-san data and ext mapper In-Reply-To: <4731CC78.6080108@mcs.anl.gov> References: <47314ED3.7010404@mcs.anl.gov> <4731CC78.6080108@mcs.anl.gov> Message-ID: On Wed, 7 Nov 2007, Michael Wilde wrote: > Sounds good, will try. With this kind of cluster problem, there's little else > we can do from outside the cluster. Did you get stuff working on another site yet? Site selection should cause better sites to be preferred by magic. -- From wilde at mcs.anl.gov Wed Nov 7 08:52:06 2007 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 07 Nov 2007 08:52:06 -0600 Subject: [Swift-devel] Re: started angle-1000 using ci-san data and ext mapper In-Reply-To: References: <47314ED3.7010404@mcs.anl.gov> <4731CC78.6080108@mcs.anl.gov> Message-ID: <4731D116.6070704@mcs.anl.gov> No, have not - am working on it. Let me know if you can help. Is that error retry change ( 3 => 10) must be a patch, as I dont see a property for it? Lets also discuss what options we have for working around the data transfer problem. I'd like to propose two test runs: - data is all local, no gridftp, and data xfer is unthrottled - job throttle wide open, but job delivery rate slowed down to a GT2 happy-level. If we have a central job dispatcher that is aware of where data is from a simple map, want to see if we can then achieve fast runs. I'll be as UC in an hour but want to start a test run first. What should we run next? On 11/7/07 8:36 AM, Ben Clifford wrote: > > On Wed, 7 Nov 2007, Michael Wilde wrote: > >> Sounds good, will try. With this kind of cluster problem, there's little else >> we can do from outside the cluster. > > Did you get stuff working on another site yet? Site selection should cause > better sites to be preferred by magic. > From benc at hawaga.org.uk Wed Nov 7 09:00:45 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 7 Nov 2007 15:00:45 +0000 (GMT) Subject: [Swift-devel] Re: started angle-1000 using ci-san data and ext mapper In-Reply-To: <4731D116.6070704@mcs.anl.gov> References: <47314ED3.7010404@mcs.anl.gov> <4731CC78.6080108@mcs.anl.gov> <4731D116.6070704@mcs.anl.gov> Message-ID: On Wed, 7 Nov 2007, Michael Wilde wrote: > Is that error retry change ( 3 => 10) must be a patch, as I dont see a > property for it? yes. though I find myself tweaking it enough recently that I'll add it a a property sometime soon, I think. http://www.ci.uchicago.edu/~benc/andrew-many-retries > - job throttle wide open, but job delivery rate slowed down to a GT2 > happy-level. not sure what that means. > I'll be as UC in an hour but want to start a test run first. What should we > run next? In the absence of a reliable site to run on, not sure what there is to do. -- From wilde at mcs.anl.gov Wed Nov 7 09:31:07 2007 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 07 Nov 2007 09:31:07 -0600 Subject: [Swift-devel] Re: started angle-1000 using ci-san data and ext mapper In-Reply-To: References: <47314ED3.7010404@mcs.anl.gov> <4731CC78.6080108@mcs.anl.gov> <4731D116.6070704@mcs.anl.gov> Message-ID: <4731DA3B.8000308@mcs.anl.gov> >> I'll be as UC in an hour but want to start a test run first. What should we >> run next? > > In the absence of a reliable site to run on, not sure what there is to do. Im going to run a test on just the data stagein problem: same data but to 1 job. That should help separate the throttle problems from the basic data problems. From benc at hawaga.org.uk Wed Nov 7 10:05:25 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 7 Nov 2007 16:05:25 +0000 (GMT) Subject: [Swift-devel] Re: started angle-1000 using ci-san data and ext mapper In-Reply-To: <4731DA3B.8000308@mcs.anl.gov> References: <47314ED3.7010404@mcs.anl.gov> <4731CC78.6080108@mcs.anl.gov> <4731D116.6070704@mcs.anl.gov> <4731DA3B.8000308@mcs.anl.gov> Message-ID: now that you're running with larger compute jobs, the amount of time spent in staging as a proportion of the whole runtime is much less. also, are you running with lazy errors on or off (that is set in the swift properties). I think off. for the purposes of letting runs continue longer, that might be a useful setting to turn on. On Wed, 7 Nov 2007, Michael Wilde wrote: > > > > I'll be as UC in an hour but want to start a test run first. What should > > > we > > > run next? > > > > In the absence of a reliable site to run on, not sure what there is to do. > > Im going to run a test on just the data stagein problem: same data but to 1 > job. > > That should help separate the throttle problems from the basic data problems. > > > From benc at hawaga.org.uk Wed Nov 7 10:08:16 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 7 Nov 2007 16:08:16 +0000 (GMT) Subject: [Swift-devel] Re: started angle-1000 using ci-san data and ext mapper In-Reply-To: References: <47314ED3.7010404@mcs.anl.gov> <4731CC78.6080108@mcs.anl.gov> <4731D116.6070704@mcs.anl.gov> <4731DA3B.8000308@mcs.anl.gov> Message-ID: also cog r1833 has a change in logging that makes log processing work better. that would be good to use in future runs. -- From wilde at mcs.anl.gov Wed Nov 7 12:44:39 2007 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 07 Nov 2007 12:44:39 -0600 Subject: [Swift-devel] Data transfer test results Message-ID: <47320797.9070106@mcs.anl.gov> I ran a swift script that invoked one app with the same 1000 45MB input files that angle-1000 reads. It took an hour to stage in these files from local disk to uc-tg. I think thats way too slow and suggests that we have a problem in the basic data transfer mechanism. -- Swift script t1.swift starting at Wed Nov 7 10:39:20 CST 2007running on sites: UC-nfs-gt2-ksSwift v0.3-dev r1463RunID: 20071107-1039-unmh7sed catls started catls completed Swift Script t1.swift ended at Wed Nov 7 11:41:17 CST 2007 with exit code 0 -- The logs are in swift-logs/wilde/run172. This merits analysis. I will re-run this with the data coming from CI SAN gridftp via teraport server. - Mike From wilde at mcs.anl.gov Wed Nov 7 12:49:32 2007 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 07 Nov 2007 12:49:32 -0600 Subject: [Swift-devel] Please re-send info on GridFTP servers for CI SAN Message-ID: <473208BC.3040803@mcs.anl.gov> Ti, a question for our SC analytics challenge: Can you re-post the info you sent a while back on the gridftp server on the CI SAN? We can access the space from the tp-osg gridftp server; when I tried to do so yesterday I ran into errors. I didnt record, because it wasnt clear to me if I was using the right URL to contact it. (I tried stor.ci.uchicago.edu in at least one run). Then I tried stor1. I think stor failed and stor1 hung. Before I try to capture all this and post here for debugging, can you tell us what the correct URL is to use the server, and any other considerations for striped transfer? Should stor be a faster server than tp-osg for this data, or same? can you post a few how-to notes on this to the CI wiki page related to using the SAN? Thanks, Mike From benc at hawaga.org.uk Wed Nov 7 12:58:18 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 7 Nov 2007 18:58:18 +0000 (GMT) Subject: [Swift-devel] Re: Data transfer test results In-Reply-To: <47320797.9070106@mcs.anl.gov> References: <47320797.9070106@mcs.anl.gov> Message-ID: On Wed, 7 Nov 2007, Michael Wilde wrote: > I ran a swift script that invoked one app with the same 1000 45MB input files > that angle-1000 reads. 45mb files don't seem particularly representative of the data - I picked spool_190 at random and see that the average file size is 14mb but that many are small, in the 20kb range. Tuning for 45mb files is likely to not be the same as tuning for 23kb files. -- From benc at hawaga.org.uk Wed Nov 7 13:17:18 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 7 Nov 2007 19:17:18 +0000 (GMT) Subject: [Swift-devel] Data transfer test results In-Reply-To: <47320797.9070106@mcs.anl.gov> References: <47320797.9070106@mcs.anl.gov> Message-ID: On Wed, 7 Nov 2007, Michael Wilde wrote: > It took an hour to stage in these files from local disk to uc-tg. I think > thats way too slow and suggests that we have a problem in the basic data > transfer mechanism. Workflow spent whole time pegged at 64 transfers at once maximum. -- From benc at hawaga.org.uk Wed Nov 7 13:19:41 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 7 Nov 2007 19:19:41 +0000 (GMT) Subject: [Swift-devel] Data transfer test results In-Reply-To: References: <47320797.9070106@mcs.anl.gov> Message-ID: On Wed, 7 Nov 2007, Ben Clifford wrote: > > It took an hour to stage in these files from local disk to uc-tg. I think > > thats way too slow and suggests that we have a problem in the basic data > > transfer mechanism. > > Workflow spent whole time pegged at 64 transfers at once maximum. and stats on karajan file transfer tasks are: Total number of events: 1004 Shortest event (s): 0.223000049591064 Longest event (s): 284.994999885559 Total duration of all events (s): 225762.589998722 Mean event duration (s): 224.863137448926 45mb / 224s = 200kb/s which is pretty ass. that 200kb isn't caused by rate limiting at the karajan scheduler level, though. -- From wilde at mcs.anl.gov Wed Nov 7 13:25:54 2007 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 07 Nov 2007 13:25:54 -0600 Subject: [Swift-devel] Re: Data transfer test results In-Reply-To: References: <47320797.9070106@mcs.anl.gov> Message-ID: <47321142.7030208@mcs.anl.gov> I think in general they are larger, but I will investigate. In the meantime, run173 just finished, staging in data from the tp-osg server. This went very nice: staged in 1000 45MB files in 10 minutes! I think yesterday I measured the disk-to-disk copy time using dd of a 2.5GB file at 2 minutes, so this WAN transfer from CI to Argonne at 10 minutes is only about 2.5X slower. Thats not bad, and 10 minutes to stage the whole dataset is not bad. Lets discuss net how best to achieve or simulate/hack caching of inputs on the local site. Whats the best way to do that and test it? - Mike (btw - run173 above failed in the end, I think, due to long cmd line length. Need to discuss that as well, as we may need to demo a summarization job such as this test simulates). On 11/7/07 12:58 PM, Ben Clifford wrote: > > On Wed, 7 Nov 2007, Michael Wilde wrote: > >> I ran a swift script that invoked one app with the same 1000 45MB input files >> that angle-1000 reads. > > 45mb files don't seem particularly representative of the data - I picked > spool_190 at random and see that the average file size is 14mb but that > many are small, in the 20kb range. > > Tuning for 45mb files is likely to not be the same as tuning for 23kb > files. > From benc at hawaga.org.uk Wed Nov 7 13:31:43 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Wed, 7 Nov 2007 19:31:43 +0000 (GMT) Subject: [Swift-devel] Data transfer test results In-Reply-To: References: <47320797.9070106@mcs.anl.gov> Message-ID: in t1, you have no structure to your submit files and you're staging into up to 8 different FTP servers; I could imagine that there's shared filesystem contention there as eight servers pass the lock for the root of the site-side data cache around. You could try two things: i) structure your data more sensibly (eg. hierarchical directories) ii) use only one gridftp server, eg name one of the servers explicitly like tg-s008.uc.teragrid.org rather than using the tg-gridftp name which goes to 8 different servers. -- From leggett at ci.uchicago.edu Wed Nov 7 15:19:06 2007 From: leggett at ci.uchicago.edu (Ti Leggett) Date: Wed, 7 Nov 2007 15:19:06 -0600 Subject: [Swift-devel] Re: Please re-send info on GridFTP servers for CI SAN In-Reply-To: <473208BC.3040803@mcs.anl.gov> References: <473208BC.3040803@mcs.anl.gov> Message-ID: <9D503713-1C9B-4CAC-980B-DF5DA31DFB2B@ci.uchicago.edu> This should be working, there was an error in the gridftp configuration files. On Nov 7, 2007, at 12:49 PM, Michael Wilde wrote: > Ti, a question for our SC analytics challenge: > > Can you re-post the info you sent a while back on the gridftp server > on the CI SAN? > > We can access the space from the tp-osg gridftp server; when I tried > to do so yesterday I ran into errors. I didnt record, because it > wasnt clear to me if I was using the right URL to contact it. (I > tried stor.ci.uchicago.edu in at least one run). Then I tried > stor1. I think stor failed and stor1 hung. > > Before I try to capture all this and post here for debugging, can > you tell us what the correct URL is to use the server, and any other > considerations for striped transfer? > > Should stor be a faster server than tp-osg for this data, or same? > > can you post a few how-to notes on this to the CI wiki page related > to using the SAN? > > Thanks, > > Mike > From benc at hawaga.org.uk Thu Nov 8 09:41:35 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 8 Nov 2007 15:41:35 +0000 (GMT) Subject: [Swift-devel] job success in the presence of massive brokenness Message-ID: I have been interested to see how over the past few days fairly large job failure rates (on the order of 30%) have still been able to run successful job completions and (with appropriately hacked scheduler) not get stuck at an appallingly slow completion rate. Eventually when some real investigation happens in the scheduler, putting in articificially broken job/file transfer submission to see how things perform will be an interesting thing to do. -- From wilde at mcs.anl.gov Fri Nov 9 07:50:41 2007 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 09 Nov 2007 07:50:41 -0600 Subject: [Swift-devel] timing stats from run194 Message-ID: <473465B1.1080901@mcs.anl.gov> the attached list of runtimes from angle run194 is interesting - there is quite a variance. one can see how "unlucky" clusters of jobs will hit the time limit. only thing i can think of is to tweak the limit up. Would be good to plot: - displays of runtime range etc - runtime vs input file size (kickstart can tell us this but it takes coordination with the kicktart caller that i dont think swift does yet) -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: kickstart.summary.byruntime URL: From benc at hawaga.org.uk Fri Nov 9 07:58:26 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 9 Nov 2007 13:58:26 +0000 (GMT) Subject: [Swift-devel] Re: timing stats from run194 In-Reply-To: <473465B1.1080901@mcs.anl.gov> References: <473465B1.1080901@mcs.anl.gov> Message-ID: On Fri, 9 Nov 2007, Michael Wilde wrote: > the attached list of runtimes from angle run194 is interesting - there is > quite a variance. one can see how "unlucky" clusters of jobs will hit the time > limit. > > only thing i can think of is to tweak the limit up. > > Would be good to plot: > - displays of runtime range etc That gets plotted already - I just sent you that. The 90s wall time limit is capturing maybe 70% of successful jobs. http://www.ci.uchicago.edu/~benc/log-processing/report-20071108-2248-okx1odlc/kickstart-duration-histogram.png -- From benc at hawaga.org.uk Fri Nov 9 08:06:46 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 9 Nov 2007 14:06:46 +0000 (GMT) Subject: [Swift-devel] Re: timing stats from run194 In-Reply-To: References: <473465B1.1080901@mcs.anl.gov> Message-ID: I suspect what may be happening in the most recent run is that a bunch of long jobs are accumulating for retry, having failed earlier due to walltimes, and are now spending forever over and over running out of walltime. -- From wilde at mcs.anl.gov Fri Nov 9 08:19:19 2007 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 09 Nov 2007 08:19:19 -0600 Subject: [Swift-devel] Re: timing stats from run194 In-Reply-To: References: <473465B1.1080901@mcs.anl.gov> Message-ID: <47346C67.4080501@mcs.anl.gov> Ah - a perfectly logical explanation, and a hard case to handle with retry. Perhaps the retry mechanism should be taught to recognize over-walltime errors and bump up the walltime for the failures based on per-application settings. On 11/9/07 8:06 AM, Ben Clifford wrote: > I suspect what may be happening in the most recent run is that a bunch of > long jobs are accumulating for retry, having failed earlier due to > walltimes, and are now spending forever over and over running out of > walltime. > From benc at hawaga.org.uk Fri Nov 9 08:24:50 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 9 Nov 2007 14:24:50 +0000 (GMT) Subject: [Swift-devel] Re: timing stats from run194 In-Reply-To: <47346C67.4080501@mcs.anl.gov> References: <473465B1.1080901@mcs.anl.gov> <47346C67.4080501@mcs.anl.gov> Message-ID: On Fri, 9 Nov 2007, Michael Wilde wrote: > Ah - a perfectly logical explanation, and a hard case to handle with retry. > Perhaps the retry mechanism should be taught to recognize over-walltime errors > and bump up the walltime for the failures based on per-application settings. well, that's not really the semantics of maxwalltime - you as the application user assert in your maxwalltime spec that it is an error for your jobs to take longer than that. it is perhaps bad to allow one job breaking that assertion to cause a clusterful of jobs to fail. it may also be more sensible in the case of widely varying loads to specify the clusteriness in terms of jobs-per-cluster rather than the present maxwalltime based approach. exciting application-specific estimation of appropriate maxwalltimes for invocations, rather than for all invocations of an app - based (eg) on input file or other parameters is an option to also investigate in the future. -- From wilde at mcs.anl.gov Fri Nov 9 08:47:26 2007 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 09 Nov 2007 08:47:26 -0600 Subject: [Swift-devel] Emails from LSF from NCSA tungsten: normal or problem? Message-ID: <473472FE.9050709@mcs.anl.gov> I get tens to hundres of these from tungsten. I dont see an error here - are these normal and if so can i prevent the email? If it indicates an error, what is this telling me? (the short runtime at th end is suspcicious but other than that I dont see an error in here) -------- Original Message -------- Subject: Job 1110502: <#! /bin/sh;#;# LSF batch job script built by Globus 4.0.1-r3 Job Manager;#;#BSUB -i /dev/null;#BSUB -o /dev/null;#BSUB -e /dev/null;#BSUB -N;#BSUB -n 1;X509_USER_PROXY=/u/ac/wilde/.globus/job/tund.ncsa.uiuc.edu/21595.1194618548/x509> Done Date: Fri, 9 Nov 2007 08:36:10 -0600 From: LSF To: wilde at ncsa.uiuc.edu Job <#! /bin/sh;#;# LSF batch job script built by Globus 4.0.1-r3 Job Manager;#;#BSUB -i /dev/null;#BSUB -o /dev/null;#BSUB -e /dev/null;#BSUB -N;#BSUB -n 1;X509_USER_PROXY=/u/ac/wilde/.globus/job/tund.ncsa.uiuc.edu/21595.1194618548/x509> was submitted from host by user . Job was executed on host(s) , in queue , as user . was used as the home directory. was used as the working directory. Started at Fri Nov 9 08:35:49 2007 Results reported at Fri Nov 9 08:36:10 2007 Your job looked like: ------------------------------------------------------------ # LSBATCH: User input #! /bin/sh # # LSF batch job script built by Globus 4.0.1-r3 Job Manager # #BSUB -i /dev/null #BSUB -o /dev/null #BSUB -e /dev/null #BSUB -N #BSUB -n 1 X509_USER_PROXY=/u/ac/wilde/.globus/job/tund.ncsa.uiuc.edu/21595.1194618548/x509_up; export X509_USER_PROXY GLOBUS_LOCATION=/usr/local/prews-gram-4.0.1-r3/; export GLOBUS_LOCATION GLOBUS_GRAM_JOB_CONTACT=https://tund.ncsa.uiuc.edu:50031/21595/1194618548/; export GLOBUS_GRAM_JOB_CONTACT GLOBUS_GRAM_MYJOB_CONTACT=URLx-nexus://tund.ncsa.uiuc.edu:50032/; export GLOBUS_GRAM_MYJOB_CONTACT HOME=/u/ac/wilde; export HOME LOGNAME=wilde; export LOGNAME if test 'X${LD_LIBRARY_PATH}' != 'X'; then LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:" else LD_LIBRARY_PATH="" fi export LD_LIBRARY_PATH #Change to directory requested by user cd /scratch/users/wilde/swiftdata/awf8-20071109-0827-it230m61 . /usr/lsf/conf/profile.lsf && /usr/lsf/6.0/linux2.4-glibc2.3-x86/bin/lsgrun -p -m "$LSB_HOSTS" /bin/sh "shared/wrapper.sh" "angle4-ggxk5uji" "-jobdir" "g" "-e" "/u/ac/wilde/angle/bin/angle4.sh" "-out" "stdout.txt" "-err" "stderr.txt" "-i" "-d" "disks/ci-gpfs/angle/spool_1|_output/of/spool_1|_output/cf/spool_1" "-if" "/disks/ci-gpfs/angle/spool_1/ncdm2-1182355200-dump.1.272.pcap.gz" "-of" "_output/of/spool_1/of.ncdm2-1182355200-dump.1.272.angle|_output/cf/spool_1/cf.ncdm2-1182355200-dump.1.272.center" "-k" "/u/ac/wilde/swift/tools/mystart" "-a" "disks/ci-gpfs/angle/spool_1/ncdm2-1182355200-dump.1.272.pcap.gz" "_output/of/spool_1/of.ncdm2-1182355200-dump.1.272.angle" "_output/cf/spool_1/cf.ncdm2-1182355200-dump.1.272.center" 2> /dev/null ------------------------------------------------------------ Successfully completed. Resource usage summary: CPU time : 0.71 sec. Max Memory : 3 MB Max Swap : 6 MB Max Processes : 1 Max Threads : 1 From benc at hawaga.org.uk Fri Nov 9 08:52:39 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Fri, 9 Nov 2007 14:52:39 +0000 (GMT) Subject: [Swift-devel] Emails from LSF from NCSA tungsten: normal or problem? In-Reply-To: <473472FE.9050709@mcs.anl.gov> References: <473472FE.9050709@mcs.anl.gov> Message-ID: On Fri, 9 Nov 2007, Michael Wilde wrote: > I get tens to hundres of these from tungsten. > I dont see an error here - are these normal and if so can i prevent the email? > > If it indicates an error, what is this telling me? > > (the short runtime at th end is suspcicious but other than that I dont see an > error in here) It ends with: Successfully completed. so my first guess is that its a success message. Not sure how to make those go away. -- From wilde at mcs.anl.gov Fri Nov 9 09:14:29 2007 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 09 Nov 2007 09:14:29 -0600 Subject: [Swift-devel] Emails from LSF from NCSA tungsten: normal or problem? In-Reply-To: References: <473472FE.9050709@mcs.anl.gov> Message-ID: <47347955.3010306@mcs.anl.gov> what puzzles me is the very short runtime - unless thats the cpu time of the wrapper script, not including the cpu time of its kids. which would be odd since it seems to know the number of processes that were running (if thats what the "Max Processes :6" means.) So I will ignore for now. On 11/9/07 8:52 AM, Ben Clifford wrote: > > On Fri, 9 Nov 2007, Michael Wilde wrote: > >> I get tens to hundres of these from tungsten. >> I dont see an error here - are these normal and if so can i prevent the email? >> >> If it indicates an error, what is this telling me? >> >> (the short runtime at th end is suspcicious but other than that I dont see an >> error in here) > > It ends with: Successfully completed. > so my first guess is that its a success message. Not sure how to make > those go away. > From wilde at mcs.anl.gov Tue Nov 13 04:30:48 2007 From: wilde at mcs.anl.gov (Michael Wilde) Date: Tue, 13 Nov 2007 04:30:48 -0600 Subject: [Swift-devel] clustering problem: Message-ID: <47397CD8.505@mcs.anl.gov> I suspect a problem in clustering. I had the following entries in tc.data: UC angle /home/wilde/angle32/bin/angle.multiarch.sh INSTALLED INTEL32::LINUX GLOBUS::maxwalltime=20; sdsc angle /users/ux454325/angle/bin/angle.sh INSTALLED INTEL32::LINUX GLOBUS::maxwalltime=20; tungsten angle /u/ac/wilde/angle/bin/angle.sh INSTALLED INTEL32::LINUX GLOBUS::maxwalltime=20; teraport angle /home/wilde/angle/bin/angle.sh INSTALLED INTEL32::LINUX GLOBUS::maxwalltime=20; mercury angle /home/ncsa/wilde/angle/bin/angle.sh INSTALLED INTEL32::LINUX GLOBUS::maxwalltime=20; and the following swift.properties: kickstart.always.transfer=true clustering.enabled=true clustering.queue.delay=15 clustering.min.time=12000 throttle.transfers=64 sitedir.keep=true lazy.errors=true -- which when I ran a batch of 100 jobs, caused job manager failures and no jobs started. the server side jobs, inf and status dirs were empty. No jobs would show up in the PBS queue. I found the following in the serve-side gram logs: gram_job_mgr_1000.log:11/13 03:36:04 JM: GT3 extended error message: GRAM_SCRIPT_GT3_FAILURE_MESSAGE:This job will be charged to account: brn (TG-CCR080001) qsub: Illegal attribute or resource value for Resource_List.walltime gram_job_mgr_1000.log:11/13 03:36:04 JMI: while return_buf = GRAM_SCRIPT_ERROR = 17 -- when I changed maxwalltime to "00:05:00" and the properties to: clustering.queue.delay=30 clustering.min.time=1200 throttle.transfers=16 things work, and all 100 jobs finish smoothly. I suspect that something in my previous parameters is causing an invalid walltime to be sent to pbs. Still digging into this but need help. From wilde at mcs.anl.gov Fri Nov 16 01:19:11 2007 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 16 Nov 2007 01:19:11 -0600 Subject: [Swift-devel] pbs jobs lingering in completing state on uc-teragrid Message-ID: <473D446F.9060501@mcs.anl.gov> Hi Help Team, A question for the Argonne TG group: Starting Tue i saw for the first time that my jobs on uc-teragrid seemed to be lingering for a while in the "C" - completing - state. Is this normal, or a new behavior, or just something I didnt notice before. I dont have an exact time on how long they linger, butit seemed unusual. Any thoughts on this, Ti? - Mike From wilde at mcs.anl.gov Fri Nov 16 01:46:43 2007 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 16 Nov 2007 01:46:43 -0600 Subject: [Swift-devel] swift status monitors Message-ID: <473D4AE3.8030005@mcs.anl.gov> Mihael, do you have something that we can use to display run status? I was fiddling with a small curses-based tool to do this ( tail -f run*.log | shredlog | curse ) but Ben reminded me that you have something brewing. From hategan at mcs.anl.gov Fri Nov 16 21:02:02 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 16 Nov 2007 21:02:02 -0600 Subject: [Swift-devel] Re: swift status monitors In-Reply-To: <473D4AE3.8030005@mcs.anl.gov> References: <473D4AE3.8030005@mcs.anl.gov> Message-ID: <1195268522.3714.7.camel@blabla.mcs.anl.gov> On Fri, 2007-11-16 at 01:46 -0600, Michael Wilde wrote: > Mihael, do you have something that we can use to display run status? > > I was fiddling with a small curses-based tool to do this > > ( tail -f run*.log | shredlog | curse ) > > but Ben reminded me that you have something brewing. "brewing" is the right word. > From benc at hawaga.org.uk Tue Nov 20 00:04:59 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 20 Nov 2007 06:04:59 +0000 (GMT) Subject: [Swift-devel] playing with array closing. Message-ID: I spent some time at the weekend and today playing with 'the array closing problem'. The array-closing problem is what happens when we combine single-assignment semantics (which say that you will only write a=foo; once for each variable a) with our array assignment semantics (which say that arrays are populated by multiple assignments, a[0]=foo; a[1]=bar;). Below, exhibit A, is a program which does not work in the present trunk implementation - instead it hangs after executing top-level statements R,S,T and before executing statement W. Statement W will not be executed until the array name 'array' is closed, that is, until it is known that there are no further writes to the array. So I prototyped some compile-time dataflow analysis (a bit like the present input marking code that already exists) to see that statements R,S,T write (or potentially write to) 'array' and that no other statements do. Armed with this knowledge, the compiled karajan code is modified so that: i) when datasets are created (using vdl:new) they are labelled with a list of statements that may write to them. ii) those statements are modified so that they notify the appropriate datasets when they have finished. So each statement issues a partial close on the datasets it writes to, and each dataset is aware which partial closes to expect. When a dataset has received partial closes (at runtime) from everything it is expecting (which is determined at compile time), it becomes fully closed. In the example code, that means that statement W's dependency on the array being closed is now satisfied, and so it is executed, and so this workflow ends. Its not so straightforward - for example, statement U writes to the array several times, and we don't want the first write to do the corresponding partial close. So the above processing happens only for statements in the same scope as the declaration. In the case of sub-scopes, such as inside a foreach, partial closes don't happen, but the enclosing statement (foreach in the example below) are treated as a single statement which completes and closes only when the whole loop is finished. I think this is the right approach to pursue for this problem. Also, I think that this implementation could join up with the present dataset marking code (which is used to determine what is an input and what is not), and also be used for better compile time type checking and related things (eg. checking for variables declared multiple times, variables assigned to multiple times when they shouldn't be, ...) ==== EXHIBIT A, being a program which does not work in the present trunk implementation ==== type file; (file f) writefile(int s) { app { echo s stdout=@f; } } (file f) listvals(file array[]) { app { echo @filenames(array) stdout=@f; } } file array[]; (Q) array[0]=writefile(99999); (R) array[1]=writefile(10000); (S) foreach i in [2:5] { (T) array[i]=writefile(i+80); (U) } file out <"out">; (V) out = listvals(array); (W) From hategan at mcs.anl.gov Tue Nov 20 00:42:35 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 20 Nov 2007 00:42:35 -0600 Subject: [Swift-devel] playing with array closing. In-Reply-To: References: Message-ID: <1195540956.971.5.camel@blabla.mcs.anl.gov> I'm thinking... It may be ok to deal with array closing lexically instead of in a dataflow way. In other words close the array after the last lexical write (the scoping problem you mention still remains, but seems ok). This may simplify the implementation and have less memory overhead. The downside is that some corner cases may still break (e.g. calling listvals(array) from inside the foreach - though maybe that breaks anyway). On Tue, 2007-11-20 at 06:04 +0000, Ben Clifford wrote: > I spent some time at the weekend and today playing with 'the array closing > problem'. > > The array-closing problem is what happens when we combine > single-assignment semantics (which say that you will only write a=foo; > once for each variable a) with our array assignment semantics (which say > that arrays are populated by multiple assignments, a[0]=foo; a[1]=bar;). > > Below, exhibit A, is a program which does not work in the present > trunk implementation - instead it hangs after executing > top-level statements R,S,T and before executing statement W. > > Statement W will not be executed until the array name 'array' is closed, > that is, until it is known that there are no further writes to the array. > > So I prototyped some compile-time dataflow analysis (a bit like the > present input marking code that already exists) to see that statements > R,S,T write (or potentially write to) 'array' and that no other statements > do. > > Armed with this knowledge, the compiled karajan code is modified so that: > i) when datasets are created (using vdl:new) they are labelled with a > list of statements that may write to them. > ii) those statements are modified so that they notify the appropriate > datasets when they have finished. > > So each statement issues a partial close on the datasets it writes to, and > each dataset is aware which partial closes to expect. > > When a dataset has received partial closes (at runtime) from everything it > is expecting (which is determined at compile time), it becomes fully > closed. > > In the example code, that means that statement W's dependency on the array > being closed is now satisfied, and so it is executed, and so this workflow > ends. > > Its not so straightforward - for example, statement U writes to the array > several times, and we don't want the first write to do the corresponding > partial close. So the above processing happens only for statements in the > same scope as the declaration. In the case of sub-scopes, such as inside a > foreach, partial closes don't happen, but the enclosing statement (foreach > in the example below) are treated as a single statement which completes > and closes only when the whole loop is finished. > > I think this is the right approach to pursue for this problem. > > Also, I think that this implementation could join up with the present > dataset marking code (which is used to determine what is an input and what > is not), and also be used for better compile time type checking and > related things (eg. checking for variables declared multiple times, > variables assigned to multiple times when they shouldn't be, ...) > > ==== EXHIBIT A, being a program which does not work in the present trunk > implementation ==== > type file; > > (file f) writefile(int s) { > app { > echo s stdout=@f; > } > } > > > (file f) listvals(file array[]) { > app { > echo @filenames(array) stdout=@f; > } > } > > file array[]; (Q) > > array[0]=writefile(99999); (R) > array[1]=writefile(10000); (S) > > foreach i in [2:5] { (T) > array[i]=writefile(i+80); (U) > } > > file out <"out">; (V) > > out = listvals(array); (W) > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From benc at hawaga.org.uk Tue Nov 20 00:54:37 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 20 Nov 2007 06:54:37 +0000 (GMT) Subject: [Swift-devel] playing with array closing. In-Reply-To: <1195540956.971.5.camel@blabla.mcs.anl.gov> References: <1195540956.971.5.camel@blabla.mcs.anl.gov> Message-ID: On Tue, 20 Nov 2007, Mihael Hategan wrote: > It may be ok to deal with array closing lexically instead of in a > dataflow way. In other words close the array after the last lexical > write (the scoping problem you mention still remains, but seems ok). > This may simplify the implementation and have less memory overhead. That's pretty much what this is. Lexical treatment at compile time. But I think there needs to be some runtime join of the various statements because they don't get executed (or rather, don't complete) in lexical order. > The downside is that some corner cases may still break (e.g. calling > listvals(array) from inside the foreach - though maybe that breaks > anyway). That works in what I did if the loop body doesn't also assign to the array. However, it has deadlock problems if the loop body both assigns to the array and reads from it. With more complication at runtime, I think thats rectifiable - rather than partially-closing after the loop entirely finishes, should be possible to track which pieces of the inner loop have run and close after the appropriate statements have been run for each iteration. -- From hategan at mcs.anl.gov Tue Nov 20 09:44:27 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 20 Nov 2007 09:44:27 -0600 Subject: [Swift-devel] playing with array closing. In-Reply-To: References: <1195540956.971.5.camel@blabla.mcs.anl.gov> Message-ID: <1195573468.2022.6.camel@blabla.mcs.anl.gov> On Tue, 2007-11-20 at 06:54 +0000, Ben Clifford wrote: > > On Tue, 20 Nov 2007, Mihael Hategan wrote: > > > It may be ok to deal with array closing lexically instead of in a > > dataflow way. In other words close the array after the last lexical > > write (the scoping problem you mention still remains, but seems ok). > > This may simplify the implementation and have less memory overhead. > > That's pretty much what this is. Lexical treatment at compile time. But I > think there needs to be some runtime join of the various statements > because they don't get executed (or rather, don't complete) in lexical > order. I thought you can reorder them, but it may be difficult if a single statement writes to multiple arrays (such as the foreach). > > > The downside is that some corner cases may still break (e.g. calling > > listvals(array) from inside the foreach - though maybe that breaks > > anyway). > > That works in what I did if the loop body doesn't also assign to the > array. However, it has deadlock problems if the loop body both assigns to > the array and reads from it. > > With more complication at runtime, I think thats rectifiable - rather than > partially-closing after the loop entirely finishes, should be possible to > track which pieces of the inner loop have run and close after the > appropriate statements have been run for each iteration. > From benc at hawaga.org.uk Tue Nov 20 10:41:54 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Tue, 20 Nov 2007 16:41:54 +0000 (GMT) Subject: [Swift-devel] playing with array closing. In-Reply-To: <1195573468.2022.6.camel@blabla.mcs.anl.gov> References: <1195540956.971.5.camel@blabla.mcs.anl.gov> <1195573468.2022.6.camel@blabla.mcs.anl.gov> Message-ID: On Tue, 20 Nov 2007, Mihael Hategan wrote: > I thought you can reorder them, but it may be difficult if a single > statement writes to multiple arrays (such as the foreach). Source text order is a linear order; it is possible to flatten any DAG into a linear order, but loses some of the information in the DAG. I thought a bit before about trying to reorder dataset delcarations in the compiled code with respect to execution statements, to try to get mapper parameters computed before they are used; I think what happened was that this introduced unnecessary serialisation (though maybe its possible with a suitably large mix of parallel and sequential blocks). -- From bugzilla-daemon at mcs.anl.gov Tue Nov 20 11:04:23 2007 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Tue, 20 Nov 2007 11:04:23 -0600 (CST) Subject: [Swift-devel] [Bug 112] New: error reporting in procedure declarations Message-ID: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=112 Summary: error reporting in procedure declarations Product: Swift Version: unspecified Platform: Macintosh OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: SwiftScript language AssignedTo: benc at hawaga.org.uk ReportedBy: benc at hawaga.org.uk CC: swift-devel at ci.uchicago.edu The below code fragment has poor error reporting in r1471 - the procedure declaration is invalid because 'stdout' is a keyword and cannot be used as a variable name. The parser predictor for procedure declarations predicts based on an entire valid procedure declaration being present, so gives a very poor error message. Predictor can be shortened - perhaps to left-bracket token token (messagefile stdout, messagefile b) greeting(string m) { app { echo m stdout=@filename(a) stderr=@filename(b); } } -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From hategan at mcs.anl.gov Tue Nov 20 11:08:03 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 20 Nov 2007 11:08:03 -0600 Subject: [Swift-devel] playing with array closing. In-Reply-To: References: <1195540956.971.5.camel@blabla.mcs.anl.gov> <1195573468.2022.6.camel@blabla.mcs.anl.gov> Message-ID: <1195578483.5392.3.camel@blabla.mcs.anl.gov> On Tue, 2007-11-20 at 16:41 +0000, Ben Clifford wrote: > On Tue, 20 Nov 2007, Mihael Hategan wrote: > > > I thought you can reorder them, but it may be difficult if a single > > statement writes to multiple arrays (such as the foreach). > > Source text order is a linear order; it is possible to flatten any DAG > into a linear order, but loses some of the information in the DAG. > > I thought a bit before about trying to reorder dataset delcarations in the > compiled code with respect to execution statements, to try to get mapper > parameters computed before they are used; I think what happened was that > this introduced unnecessary serialisation (though maybe its possible with > a suitably large mix of parallel and sequential blocks). It isn't in the general case. The "most relevant link" is a book, but see http://dx.doi.org/10.1016/0304-3975(94)00272-X for some info. Basically parallel (independent) and sequential (linear) blocks are not sufficient to provide a decomposition of an arbitrary dag. The article talks about graphs with no primitive structures (those that can be decomposed using only seq/par). > From bugzilla-daemon at mcs.anl.gov Tue Nov 20 11:19:03 2007 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Tue, 20 Nov 2007 11:19:03 -0600 (CST) Subject: [Swift-devel] [Bug 113] New: restarts broken in r1471 Message-ID: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=113 Summary: restarts broken in r1471 Product: Swift Version: unspecified Platform: Macintosh OS/Version: Mac OS Status: NEW Severity: normal Priority: P2 Component: General AssignedTo: hategan at mcs.anl.gov ReportedBy: benc at hawaga.org.uk Restarts don't work (at all?) in r1471. Any example restart log might contain: file://localhost/_concurrent/array-8b41edd4-8cd0-4c09-9e15-da87472c860e--array//elt-4/localhost/arrayclosehang-20071120-0915-tb5pd4ga/shared/elt-4 but on restart, execution appears to look for: ... elt-4/localhost/arrayclosehang-20071120-0917-uoh2opve/shared/elt-4 with a different run-id in the working directory name. so all work is done again. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. From bugzilla-daemon at mcs.anl.gov Tue Nov 20 12:42:34 2007 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Tue, 20 Nov 2007 12:42:34 -0600 (CST) Subject: [Swift-devel] [Bug 113] restarts broken in r1471 In-Reply-To: Message-ID: <20071120184234.0A38A164BB@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=113 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |DUPLICATE ------- Comment #1 from benc at hawaga.org.uk 2007-11-20 12:42 ------- oops *** This bug has been marked as a duplicate of 107 *** -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. From bugzilla-daemon at mcs.anl.gov Tue Nov 20 12:42:35 2007 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Tue, 20 Nov 2007 12:42:35 -0600 (CST) Subject: [Swift-devel] [Bug 107] restarts broken (by generalisation of data file handling) In-Reply-To: Message-ID: <20071120184235.3AD6D16505@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=107 ------- Comment #1 from benc at hawaga.org.uk 2007-11-20 12:42 ------- *** Bug 113 has been marked as a duplicate of this bug. *** -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. From bugzilla-daemon at mcs.anl.gov Tue Nov 20 13:07:06 2007 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Tue, 20 Nov 2007 13:07:06 -0600 (CST) Subject: [Swift-devel] [Bug 11] nested {} blocks do not cause nested variable scopes In-Reply-To: Message-ID: <20071120190706.362A6164EC@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=11 benc at hawaga.org.uk changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |swift-devel at ci.uchicago.edu Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #2 from benc at hawaga.org.uk 2007-11-20 13:07 ------- In r1486, I remove nested compound blocks entirely from the language - they appear to have never been used outside of unit tests. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From bugzilla-daemon at mcs.anl.gov Tue Nov 20 13:30:17 2007 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Tue, 20 Nov 2007 13:30:17 -0600 (CST) Subject: [Swift-devel] [Bug 39] a poor syntax error In-Reply-To: Message-ID: <20071120193017.CA851164BB@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=39 ------- Comment #1 from benc at hawaga.org.uk 2007-11-20 13:30 ------- the parser is interpreting the > as the greater-than operator in an expression: "econ_prob_list.txt" > results which is syntactically valid, rather than as the termination of the mapper declaration. This makes it get a few tokens further along in parsing than desired in this error reporting case. Use of > for both termination of mapper declaration and as a valid in-declaration token is the root cause here, I think. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You reported the bug, or are watching the reporter. You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at mcs.anl.gov Tue Nov 20 14:48:01 2007 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Tue, 20 Nov 2007 14:48:01 -0600 (CST) Subject: [Swift-devel] [Bug 11] nested {} blocks do not cause nested variable scopes In-Reply-To: Message-ID: <20071120204801.F100C164EC@foxtrot.mcs.anl.gov> http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=11 ------- Comment #3 from hategan at mcs.anl.gov 2007-11-20 14:48 ------- The fact that they were never used outside unit tests, doesn't mean that there is not value to them. On the other hand they may not be worth spending time on. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is. From benc at hawaga.org.uk Thu Nov 22 13:10:58 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Thu, 22 Nov 2007 19:10:58 +0000 (GMT) Subject: [Swift-devel] multiple declarations of variables. Message-ID: At present, the language allows multiple declarations using the same variable name resulting in the second variable shadowing the first. eg: > file i; > file i; (which is fairly obvious) but also: > file foo <"myfile">; > file foo = f(x); I've seen multiple declarations like this confuse a few people in that past. I'd like to make variable shadowing illegal - either of the above should result in a compile time error (or at least a warning). -- From hategan at mcs.anl.gov Thu Nov 22 13:54:31 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 22 Nov 2007 13:54:31 -0600 Subject: [Swift-devel] multiple declarations of variables. In-Reply-To: References: Message-ID: <1195761272.30276.0.camel@blabla.mcs.anl.gov> On Thu, 2007-11-22 at 19:10 +0000, Ben Clifford wrote: > At present, the language allows multiple declarations using the same > variable name resulting in the second variable shadowing the first. > > eg: > > > file i; > > file i; > > (which is fairly obvious) > > but also: > > > file foo <"myfile">; > > file foo = f(x); > > I've seen multiple declarations like this confuse a few people in that > past. > > I'd like to make variable shadowing illegal - either of the above should > result in a compile time error (or at least a warning). Error. Even Java and C do that. > From wilde at mcs.anl.gov Fri Nov 23 14:44:50 2007 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 23 Nov 2007 14:44:50 -0600 Subject: [Swift-devel] multiple declarations of variables. In-Reply-To: References: Message-ID: <47473BC2.8030204@mcs.anl.gov> that sounds good. On 11/22/07 1:10 PM, Ben Clifford wrote: > At present, the language allows multiple declarations using the same > variable name resulting in the second variable shadowing the first. > > eg: > >> file i; >> file i; > > (which is fairly obvious) > > but also: > >> file foo <"myfile">; >> file foo = f(x); > > I've seen multiple declarations like this confuse a few people in that > past. > > I'd like to make variable shadowing illegal - either of the above should > result in a compile time error (or at least a warning). > From hategan at mcs.anl.gov Fri Nov 23 15:42:04 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 23 Nov 2007 15:42:04 -0600 Subject: [Swift-devel] SSH support Message-ID: <1195854124.12780.7.camel@blabla.mcs.anl.gov> I've updated the SSH provider in cog to do a few things: - make better use of connections (cache them). SSH has this nifty thing: On one connection you can configure multiple independent channels (OpenSSH servers seem to support up to 10 such channels per connection). With this you get up to 10 independent shells without authenticating again. - access remote filesystems (a file op provider) with SFTP - get default authentication information from a file (~/.ssh/auth.defaults). I attached a sample. I need to document this. I also added a filesystem element in the site catalog, which works in a similar way to the execution element: /homes/hategan/tmp That basically allows Swift to work with SSH. -------------- next part -------------- localhost.type=key localhost.username=mike localhost.key=/home/mike/.ssh/identity localhost.passphrase= plussed.mcs.anl.gov.type=key plussed.mcs.anl.gov.username=hategan plussed.mcs.anl.gov.key=/home/mike/.ssh/identity plussed.mcs.anl.gov.passphrase= From benc at hawaga.org.uk Fri Nov 23 19:07:53 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 24 Nov 2007 01:07:53 +0000 (GMT) Subject: [Swift-devel] SSH support In-Reply-To: <1195854124.12780.7.camel@blabla.mcs.anl.gov> References: <1195854124.12780.7.camel@blabla.mcs.anl.gov> Message-ID: can it use ssh-agent authentication? when I looked at the ssh code a while ago it didn't seem to want to. -- From hategan at mcs.anl.gov Fri Nov 23 19:14:40 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 23 Nov 2007 19:14:40 -0600 Subject: [Swift-devel] SSH support In-Reply-To: References: <1195854124.12780.7.camel@blabla.mcs.anl.gov> Message-ID: <1195866880.20322.1.camel@blabla.mcs.anl.gov> On Sat, 2007-11-24 at 01:07 +0000, Ben Clifford wrote: > can it use ssh-agent authentication? There have been long discussions about that. The ssh agent seems to use some UNIX specific mechanisms to interact with ssh, so it's a bit weird from Java. But I never really looked into the issue in sufficient detail. I think I should. > when I looked at the ssh code a while > ago it didn't seem to want to. From benc at hawaga.org.uk Fri Nov 23 19:40:37 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 24 Nov 2007 01:40:37 +0000 (GMT) Subject: [Swift-devel] SSH support In-Reply-To: <1195866880.20322.1.camel@blabla.mcs.anl.gov> References: <1195854124.12780.7.camel@blabla.mcs.anl.gov> <1195866880.20322.1.camel@blabla.mcs.anl.gov> Message-ID: On Fri, 23 Nov 2007, Mihael Hategan wrote: > There have been long discussions about that. The ssh agent seems to use > some UNIX specific mechanisms to interact with ssh, so it's a bit weird > from Java. But I never really looked into the issue in sufficient > detail. I think I should. right, it uses unix domain sockets. I have no idea what that looks like in Java - I think nothing standard at all. I think maybe ssh-agent is also version-specific (i.e. it operates only with the ssh client from the same release as the ssh-agent) so maybe its a rather folorn hope. -- From hategan at mcs.anl.gov Fri Nov 23 19:56:52 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 23 Nov 2007 19:56:52 -0600 Subject: [Swift-devel] SSH support In-Reply-To: References: <1195854124.12780.7.camel@blabla.mcs.anl.gov> <1195866880.20322.1.camel@blabla.mcs.anl.gov> Message-ID: <1195869412.22390.3.camel@blabla.mcs.anl.gov> On Sat, 2007-11-24 at 01:40 +0000, Ben Clifford wrote: > On Fri, 23 Nov 2007, Mihael Hategan wrote: > > > There have been long discussions about that. The ssh agent seems to use > > some UNIX specific mechanisms to interact with ssh, so it's a bit weird > > from Java. But I never really looked into the issue in sufficient > > detail. I think I should. > > right, it uses unix domain sockets. I have no idea what that looks like in > Java - I think nothing standard at all. I think maybe ssh-agent is also > version-specific (i.e. it operates only with the ssh client from the same > release as the ssh-agent) so maybe its a rather folorn hope. There is a Java implementation, as far as I remember, of it (even in j2ssh). Though I've never tried it. However, there is also GSISSH. Also not sure what would take to get that to work in the current scheme. On the other hand, user generated key pairs can be very convenient. It would certainly solve the problem of having to generate proxies on a regular basis in a portal, for which it gets an A in usability/convenience. > From benc at hawaga.org.uk Fri Nov 23 20:01:07 2007 From: benc at hawaga.org.uk (Ben Clifford) Date: Sat, 24 Nov 2007 02:01:07 +0000 (GMT) Subject: [Swift-devel] SSH support In-Reply-To: <1195869412.22390.3.camel@blabla.mcs.anl.gov> References: <1195854124.12780.7.camel@blabla.mcs.anl.gov> <1195866880.20322.1.camel@blabla.mcs.anl.gov> <1195869412.22390.3.camel@blabla.mcs.anl.gov> Message-ID: On Fri, 23 Nov 2007, Mihael Hategan wrote: > On the other hand, user generated key pairs can be very convenient. It > would certainly solve the problem of having to generate proxies on a > regular basis in a portal, for which it gets an A in > usability/convenience. though if you're prepared to accept long term unencrypted credentials, making a proxy valid for the full length of its parent credntial is also a reasonable way to proceed. -- From hategan at mcs.anl.gov Fri Nov 23 20:15:09 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 23 Nov 2007 20:15:09 -0600 Subject: [Swift-devel] SSH support In-Reply-To: References: <1195854124.12780.7.camel@blabla.mcs.anl.gov> <1195866880.20322.1.camel@blabla.mcs.anl.gov> <1195869412.22390.3.camel@blabla.mcs.anl.gov> Message-ID: <1195870509.22925.10.camel@blabla.mcs.anl.gov> On Sat, 2007-11-24 at 02:01 +0000, Ben Clifford wrote: > > On Fri, 23 Nov 2007, Mihael Hategan wrote: > > > On the other hand, user generated key pairs can be very convenient. It > > would certainly solve the problem of having to generate proxies on a > > regular basis in a portal, for which it gets an A in > > usability/convenience. > > though if you're prepared to accept long term unencrypted credentials, > making a proxy valid for the full length of its parent credntial is also a > reasonable way to proceed. In a sense. One difference is that you can easily create a key pair to be used for a specific application and specific sites, entirely separate from an identity used to gain access to more critical things. It's harder to get "application certs" from CAs that are accepted by services on the typical servers we use. > From hategan at mcs.anl.gov Wed Nov 28 18:20:18 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 28 Nov 2007 18:20:18 -0600 Subject: [Swift-devel] transfers of small files Message-ID: <1196295618.29963.10.camel@blabla.mcs.anl.gov> So I've been playing with that issue. I've made some measurements outside Swift. Here's a summary: 32k files. From terminable to tg-uc 1 - karajan with connection caching. transfers in parallel. tops at 200KB/s 2 - n*globus-url-copy - With 32 parallel transfers it starts failing and gets about 10KB/s 3 - globus-url-copy with a list of files: around 300KB/s 4 - globus-url-copy with a list of files, E mode, and data channel re-use: 500KB/s So I figured I should hack the GridFTP provider to re-use data channels by default. This is where it gets strange. I get averages (over multiple runs) of over 1MB/s, with mins of about 130KB and max of 1.9MB/s, but with a lot of variability. I'll debug this. However, I think there is still value in enabling this by default. From hategan at mcs.anl.gov Wed Nov 28 18:23:21 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 28 Nov 2007 18:23:21 -0600 Subject: [Swift-devel] transfers of small files In-Reply-To: <1196295618.29963.10.camel@blabla.mcs.anl.gov> References: <1196295618.29963.10.camel@blabla.mcs.anl.gov> Message-ID: <1196295801.29963.11.camel@blabla.mcs.anl.gov> By contrast, multiple large files are transferred at a max of 11MB/s in (1). On Wed, 2007-11-28 at 18:20 -0600, Mihael Hategan wrote: > So I've been playing with that issue. I've made some measurements > outside Swift. Here's a summary: > > 32k files. From terminable to tg-uc > > 1 - karajan with connection caching. transfers in parallel. tops at > 200KB/s > > 2 - n*globus-url-copy - With 32 parallel transfers it starts failing and > gets about 10KB/s > > 3 - globus-url-copy with a list of files: around 300KB/s > > 4 - globus-url-copy with a list of files, E mode, and data channel > re-use: 500KB/s > > So I figured I should hack the GridFTP provider to re-use data channels > by default. This is where it gets strange. I get averages (over multiple > runs) of over 1MB/s, with mins of about 130KB and max of 1.9MB/s, but > with a lot of variability. I'll debug this. However, I think there is > still value in enabling this by default. > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > From foster at mcs.anl.gov Wed Nov 28 18:24:54 2007 From: foster at mcs.anl.gov (Ian Foster) Date: Wed, 28 Nov 2007 18:24:54 -0600 Subject: [Swift-devel] transfers of small files In-Reply-To: <1196295618.29963.10.camel@blabla.mcs.anl.gov> References: <1196295618.29963.10.camel@blabla.mcs.anl.gov> Message-ID: <474E06D6.9030601@mcs.anl.gov> Mihael: It isn't clear to me--are you using the "lots of small files" optimization here? I've CCed John Bresnahan so he can comment. Ian. Mihael Hategan wrote: > So I've been playing with that issue. I've made some measurements > outside Swift. Here's a summary: > > 32k files. From terminable to tg-uc > > 1 - karajan with connection caching. transfers in parallel. tops at > 200KB/s > > 2 - n*globus-url-copy - With 32 parallel transfers it starts failing and > gets about 10KB/s > > 3 - globus-url-copy with a list of files: around 300KB/s > > 4 - globus-url-copy with a list of files, E mode, and data channel > re-use: 500KB/s > > So I figured I should hack the GridFTP provider to re-use data channels > by default. This is where it gets strange. I get averages (over multiple > runs) of over 1MB/s, with mins of about 130KB and max of 1.9MB/s, but > with a lot of variability. I'll debug this. However, I think there is > still value in enabling this by default. > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > -- Ian Foster, Director, Computation Institute Argonne National Laboratory & University of Chicago Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439 Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637 Tel: +1 630 252 4619. Web: www.ci.uchicago.edu. Globus Alliance: www.globus.org. From hategan at mcs.anl.gov Wed Nov 28 18:31:58 2007 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 28 Nov 2007 18:31:58 -0600 Subject: [Swift-devel] transfers of small files In-Reply-To: <474E06D6.9030601@mcs.anl.gov> References: <1196295618.29963.10.camel@blabla.mcs.anl.gov> <474E06D6.9030601@mcs.anl.gov> Message-ID: <1196296318.29963.19.camel@blabla.mcs.anl.gov> On Wed, 2007-11-28 at 18:24 -0600, Ian Foster wrote: > Mihael: > > It isn't clear to me--are you using the "lots of small files" > optimization here? It depends what you mean by "lots of small files optimization". Obviously this is an optimization for the lots of small files case. I'm re-using clients with mode E and only sending PASV once per client. Let's call this A. There was word of "pipelining". We'll call that B. I assume it to be different from what I did (A) for the following reasons: 1. Jarek had tests for A in JGlobus, so A is not a new deal. 2. Buzz recently committed some code to JGlobus to enable B, which assumes B was not possible before, therefore B != A. > > I've CCed John Bresnahan so he can comment. > > Ian. > > Mihael Hategan wrote: > > So I've been playing with that issue. I've made some measurements > > outside Swift. Here's a summary: > > > > 32k files. From terminable to tg-uc > > > > 1 - karajan with connection caching. transfers in parallel. tops at > > 200KB/s > > > > 2 - n*globus-url-copy - With 32 parallel transfers it starts failing and > > gets about 10KB/s > > > > 3 - globus-url-copy with a list of files: around 300KB/s > > > > 4 - globus-url-copy with a list of files, E mode, and data channel > > re-use: 500KB/s > > > > So I figured I should hack the GridFTP provider to re-use data channels > > by default. This is where it gets strange. I get averages (over multiple > > runs) of over 1MB/s, with mins of about 130KB and max of 1.9MB/s, but > > with a lot of variability. I'll debug this. However, I think there is > > still value in enabling this by default. > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > From itf at mcs.anl.gov Wed Nov 28 18:54:50 2007 From: itf at mcs.anl.gov (=?utf-8?B?SWFuIEZvc3Rlcg==?=) Date: Thu, 29 Nov 2007 00:54:50 +0000 Subject: [Swift-devel] transfers of small files In-Reply-To: <1196296318.29963.19.camel@blabla.mcs.anl.gov> References: <1196295618.29963.10.camel@blabla.mcs.anl.gov> <474E06D6.9030601@mcs.anl.gov><1196296318.29963.19.camel@blabla.mcs.anl.gov> Message-ID: <794941865-1196297704-cardhu_decombobulator_blackberry.rim.net-1612396432-@bxe017.bisx.prod.on.blackberry> As mentioned in an email from a few weeks ago, the gridftp guys have implemented support for streaming many small files. I would hope we would try that before implementing our own version. Ian Sent via BlackBerry from T-Mobile -----Original Message----- From: Mihael Hategan Date: Wed, 28 Nov 2007 18:31:58 To:Ian Foster Cc:swift-devel , John Bresnahan Subject: Re: [Swift-devel] transfers of small files On Wed, 2007-11-28 at 18:24 -0600, Ian Foster wrote: > Mihael: > > It isn't clear to me--are you using the "lots of small files" > optimization here? It depends what you mean by "lots of small files optimization". Obviously this is an optimization for the lots of small files case. I'm re-using clients with mode E and only sending PASV once per client. Let's call this A. There was word of "pipelining". We'll call that B. I assume it to be different from what I did (A) for the following reasons: 1. Jarek had tests for A in JGlobus, so A is not a new deal. 2. Buzz recently committed some code to JGlobus to enable B, which assumes B was not possible before, therefore B != A. > > I've CCed John Bresnahan so he can comment. > > Ian. > > Mihael Hategan wrote: > > So I've been playing with that issue. I've made some measurements > > outside Swift. Here's a summary: > > > > 32k files. From terminable to tg-uc > > > > 1 - karajan with connection caching. transfers in parallel. tops at > > 200KB/s > > > > 2 - n*globus-url-copy - With 32 parallel transfers it starts failing and > > gets about 10KB/s > > > > 3 - globus-url-copy with a list of files: around 300KB/s > > > > 4 - globus-url-copy with a list of files, E mode, and data channel > > re-use: 500KB/s > > > > So I figured I should hack the GridFTP provider to re-use data channels > > by default. This is where it gets strange. I get averages (over multiple > > runs) of over 1MB/s, with mins of about 130KB and max of 1.9MB/s, but > > with a lot of variability. I'll debug this. However, I think there is > > still value in enabling this by default. > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel > > > > > From bugzilla-daemon at mcs.anl.gov Fri Nov 30 14:03:07 2007 From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov) Date: Fri, 30 Nov 2007 14:03:07 -0600 (CST) Subject: [Swift-devel] [Bug 114] New: need to specify run directory name on remote site Message-ID: http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=114 Summary: need to specify run directory name on remote site Product: Swift Version: unspecified Platform: PC OS/Version: Linux Status: NEW Severity: major Priority: P2 Component: SwiftScript language AssignedTo: benc at hawaga.org.uk ReportedBy: skenny at uchicago.edu currently the run directory on the remote site is auto-generated by swift. it is important to be able to specify the directory name, especially if it will be used with a portal and/or a community cert so that directory names can include user name. -- Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee.