From benc at hawaga.org.uk  Thu Nov  1 05:45:16 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 1 Nov 2007 10:45:16 +0000 (GMT)
Subject: [Swift-devel] script mapper
In-Reply-To: <1193878744.18796.29.camel@blabla.mcs.anl.gov>
References: <1193869309.10145.9.camel@blabla.mcs.anl.gov>
	<1193876387.18296.5.camel@blabla.mcs.anl.gov>
	<4729216A.5060805@mcs.anl.gov>
	<1193878744.18796.29.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0711011044430.4849@dildano.hawaga.org.uk>


On Wed, 31 Oct 2007, Mihael Hategan wrote:

> > Is this commited/commitable?
> 
> Committed.

> > So the mapper is called "ext", takes a script via exec=, and then 
> > arbitrary mapper-specific args?

ext is inconsistent with the convention used for other mapper names.

-- 


From hategan at mcs.anl.gov  Thu Nov  1 09:25:04 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 01 Nov 2007 09:25:04 -0500
Subject: [Swift-devel] script mapper
In-Reply-To: <Pine.LNX.4.64.0711011044430.4849@dildano.hawaga.org.uk>
References: <1193869309.10145.9.camel@blabla.mcs.anl.gov>
	<1193876387.18296.5.camel@blabla.mcs.anl.gov>
	<4729216A.5060805@mcs.anl.gov>
	<1193878744.18796.29.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0711011044430.4849@dildano.hawaga.org.uk>
Message-ID: <1193927104.30473.1.camel@blabla.mcs.anl.gov>


On Thu, 2007-11-01 at 10:45 +0000, Ben Clifford wrote:
> 
> On Wed, 31 Oct 2007, Mihael Hategan wrote:
> 
> > > Is this commited/commitable?
> > 
> > Committed.
> 
> > > So the mapper is called "ext", takes a script via exec=, and then 
> > > arbitrary mapper-specific args?
> 
> ext is inconsistent with the convention used for other mapper names.

Yes. I think it's silly to have to add _mapper to all the mapper names.
Or not?

> 


From benc at hawaga.org.uk  Thu Nov  1 09:26:57 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 1 Nov 2007 14:26:57 +0000 (GMT)
Subject: [Swift-devel] script mapper
In-Reply-To: <1193927104.30473.1.camel@blabla.mcs.anl.gov>
References: <1193869309.10145.9.camel@blabla.mcs.anl.gov> 
	<1193876387.18296.5.camel@blabla.mcs.anl.gov>
	<4729216A.5060805@mcs.anl.gov>
	<1193878744.18796.29.camel@blabla.mcs.anl.gov> 
	<Pine.LNX.4.64.0711011044430.4849@dildano.hawaga.org.uk>
	<1193927104.30473.1.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0711011426150.21706@dildano.hawaga.org.uk>


On Thu, 1 Nov 2007, Mihael Hategan wrote:

> Yes. I think it's silly to have to add _mapper to all the mapper names.
> Or not?

It is silly.

It is also silly to have multiple conventions.

Perhaps at Christmastime, they can all be renamed again.

-- 


From hategan at mcs.anl.gov  Thu Nov  1 09:33:12 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 01 Nov 2007 09:33:12 -0500
Subject: [Swift-devel] script mapper
In-Reply-To: <Pine.LNX.4.64.0711011426150.21706@dildano.hawaga.org.uk>
References: <1193869309.10145.9.camel@blabla.mcs.anl.gov>
	<1193876387.18296.5.camel@blabla.mcs.anl.gov>
	<4729216A.5060805@mcs.anl.gov>
	<1193878744.18796.29.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0711011044430.4849@dildano.hawaga.org.uk>
	<1193927104.30473.1.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0711011426150.21706@dildano.hawaga.org.uk>
Message-ID: <1193927592.30473.8.camel@blabla.mcs.anl.gov>


On Thu, 2007-11-01 at 14:26 +0000, Ben Clifford wrote:
> 
> On Thu, 1 Nov 2007, Mihael Hategan wrote:
> 
> > Yes. I think it's silly to have to add _mapper to all the mapper names.
> > Or not?
> 
> It is silly.
> 
> It is also silly to have multiple conventions.

Right.

> 
> Perhaps at Christmastime, they can all be renamed again.

Well, it's a choice.

> 


From benc at hawaga.org.uk  Thu Nov  1 11:57:10 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 1 Nov 2007 16:57:10 +0000 (GMT)
Subject: [Swift-devel] ConcurrentMapper changes
Message-ID: <Pine.LNX.4.64.0711011626580.21706@dildano.hawaga.org.uk>


I just modified the way that ConcurrentMapper lays out files (r1437)

You will likely not have encountered ConcurrentMapper by name. It is used 
when you do not specify a mapper for a dataset, for example for 
intermediate variables.

Previously, all files named by this mapper were given a long name in the 
root directory of the submit and cache directories.

When a large number of files were named in this fashion, for example in an 
array with thousands of elements, this would result in a file for each 
element and a root directory with thousands of files.

Most immediately I encountered this problem working with Andrew Jamieson 
running on TeraPort using GPFS. Many hosts attempting to access one 
directory is severely unscalable on GPFS.

The changes I have made add more structure to filenames generated by the 
ConcurrentMapper:


 1. All files appear in a _concurrent/ subdirectory.


 2. Simple/marker data typed files appear directly below _concurrent, 
named as before. For example:

  file outfile;

might give a filename:

  _concurrent//outfile-3339612a-08e1-443d-bd14-2329080d2d94-


 3. Structures are mapped to a sub-directory, with each element being a 
file in that subdirectory. For example,

 type footype { file left; file right; }
 footype structurefile;

might give a directory:

_concurrent//structurefile-c68b99dc-de3c-4288-822f-2ab3d4dc6427--field

containing two files:

_concurrent//structurefile-c68b99dc-de3c-4288-822f-2ab3d4dc6427--field/left
_concurrent//structurefile-c68b99dc-de3c-4288-822f-2ab3d4dc6427--field/right


4. Array elements are placed in a subdirectory. Within that subdirectory, 
the index is using to construct a further hierarchy such that there will 
never be more than 50 directories/files in any one directory. For example:

  file manyfile[];

might give mappings like this:

myfile[0] stored in:
 _concurrent//manyfile-0b91d809-37f5-46da-91c8-6c4a9157b06b--array/elt-0

myfile[22] stored in:
 _concurrent//manyfile-0b91d809-37f5-46da-91c8-6c4a9157b06b--array/elt-22

myfile[30] stored in:
 _concurrent//manyfile-0b91d809-37f5-46da-91c8-6c4a9157b06b--array/h5/elt-30

myfile[734] stored in:
 _concurrent//manyfile-bcdeedee-4df7-4d21-a207-d8051da3d133--array/h9/h4/elt-734

To form the paths, basically something like this happens:
convert each number into base 25. discard the most significant digit. 
then starting at the least significant digit and working towards 
the most significant digit, make that digit into a subdirectory.

For example, 734 in base 10 is  (1) (4) (9) in base 25

so we form intermediate path /h9/h4/

Doing this means that for large arrays directory paths will grow, whilst 
for small arrays will be short; and the size of the array does not need to 
be known ahead of time.

The constant '25' can easily be adjusted. Its a compiled-in constant 
defined in one place at the moment, but could be made into a mapper 
parameter.

-- 


From hategan at mcs.anl.gov  Thu Nov  1 12:10:45 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 01 Nov 2007 12:10:45 -0500
Subject: [Swift-devel] a vdc like thing
Message-ID: <1193937046.4196.1.camel@blabla.mcs.anl.gov>

http://labs.google.com/papers/chubby.html

I think it at least hints to the dimension of the VDC problem, though I
think things can be simplified by assuming only one local VDC.

Mihael


From benc at hawaga.org.uk  Thu Nov  1 18:23:34 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 1 Nov 2007 23:23:34 +0000 (GMT)
Subject: [Swift-devel] karajan scheduler hack
Message-ID: <Pine.LNX.4.64.0711012319330.21706@dildano.hawaga.org.uk>


Recently, I've been making runs with Andrew with a scheduler hack to stop 
karajan's site score go below -10.

This has been useful in stopped clustered job failures from causing 
catastrophic slow down.

Its not clear what if any easy change can be made that isn't so hackish to 
achieve this same benefit.

-- 


From hategan at mcs.anl.gov  Thu Nov  1 19:06:43 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 01 Nov 2007 19:06:43 -0500
Subject: [Swift-devel] karajan scheduler hack
In-Reply-To: <Pine.LNX.4.64.0711012319330.21706@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0711012319330.21706@dildano.hawaga.org.uk>
Message-ID: <1193962004.10923.11.camel@blabla.mcs.anl.gov>


On Thu, 2007-11-01 at 23:23 +0000, Ben Clifford wrote:
> Recently, I've been making runs with Andrew with a scheduler hack to stop 
> karajan's site score go below -10.
> 
> This has been useful in stopped clustered job failures from causing 
> catastrophic slow down.
> 
> Its not clear what if any easy change can be made that isn't so hackish to 
> achieve this same benefit.

I think that would be a reasonable hack for now.

However, I do think that the problem is the details of the algorithm not
being well thought out. In principle, I think it should remain a
feedback system (also given that the opportunistic scheduling for VDS
paper pretty much did the same). Given that, and given that with the
current assumptions all the feedback inputs are accounted for, I'm led
to believe that this is a matter of properly specifying the feedback
function. But I'm fuzzy on many things.

> 


From wilde at mcs.anl.gov  Sat Nov  3 18:19:00 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sat, 03 Nov 2007 18:19:00 -0500
Subject: [Swift-devel] Kickstart on Angle vs not?
Message-ID: <472D01E4.5010204@mcs.anl.gov>

Ben, whats the current tradeoff on running kickstart for the angle work?

When I last checked with you kickstart still goes to one dir and will 
likely cause contention.

I now realize that some of the same data can now be obtained from the 
wrapper logs.

Better to avoid kickstart then, or do you intend to work on it this week?

(making no value judgment here - just want your suggestion on most 
viable route for Angle...)


From itf at mcs.anl.gov  Sat Nov  3 18:26:27 2007
From: itf at mcs.anl.gov (=?utf-8?B?SWFuIEZvc3Rlcg==?=)
Date: Sat, 3 Nov 2007 23:26:27 +0000
Subject: [Swift-devel] Kickstart on Angle vs not?
Message-ID: <1575686083-1194132464-cardhu_decombobulator_blackberry.rim.net-1246028922-@bxe030.bisx.prod.on.blackberry>

we use a different mechanbism to retrieve kickstart ouput vs. Log file output, it seems. I'd be interested to understand them.


------Original Message------
From: Mike Wilde
Sender: swift-devel-bounces at ci.uchicago.edu
To: swift-devel
To: Benjamin Clifford
Sent: Nov 3, 2007 6:19 PM
Subject: [Swift-devel] Kickstart on Angle vs not?

Ben, whats the current tradeoff on running kickstart for the angle work?

When I last checked with you kickstart still goes to one dir and will 
likely cause contention.

I now realize that some of the same data can now be obtained from the 
wrapper logs.

Better to avoid kickstart then, or do you intend to work on it this week?

(making no value judgment here - just want your suggestion on most 
viable route for Angle...)
_______________________________________________
Swift-devel mailing list
Swift-devel at ci.uchicago.edu
http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


Sent via BlackBerry from T-Mobile


From benc at hawaga.org.uk  Sat Nov  3 18:53:32 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sat, 3 Nov 2007 23:53:32 +0000 (GMT)
Subject: [Swift-devel] Re: Kickstart on Angle vs not?
In-Reply-To: <472D01E4.5010204@mcs.anl.gov>
References: <472D01E4.5010204@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0711032352040.21706@dildano.hawaga.org.uk>


On Sat, 3 Nov 2007, Michael Wilde wrote:

> When I last checked with you kickstart still goes to one dir and will likely
> cause contention.

It doesn't any more - I changed it in the commits yesterday / thursday at 
the same time I did the other ones. Its not been heavily tested, though.

> (making no value judgment here - just want your suggestion on most viable
> route for Angle...)

kickstart and the info logs provide somewhat different info. For measuring 
conflict on the shared filesystems, the info logs are probably more 
useful.

-- 


From benc at hawaga.org.uk  Sat Nov  3 18:59:04 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sat, 3 Nov 2007 23:59:04 +0000 (GMT)
Subject: [Swift-devel] Kickstart on Angle vs not?
In-Reply-To: <1575686083-1194132464-cardhu_decombobulator_blackberry.rim.net-1246028922-@bxe030.bisx.prod.on.blackberry>
References: <1575686083-1194132464-cardhu_decombobulator_blackberry.rim.net-1246028922-@bxe030.bisx.prod.on.blackberry>
Message-ID: <Pine.LNX.4.64.0711032358050.4849@dildano.hawaga.org.uk>


On Sat, 3 Nov 2007, Ian Foster wrote:

> we use a different mechanbism to retrieve kickstart ouput vs. Log file 
> output, it seems. I'd be interested to understand them.

kickstart records get sent back to the submit host automatically (subject 
to various configuration options). wrapper logs never get staged anywhere 
automatically.

-- 


From wilde at mcs.anl.gov  Sat Nov  3 23:34:02 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sat, 03 Nov 2007 23:34:02 -0500
Subject: [Swift-devel] Error in syncing job start with input file
	availability?
Message-ID: <472D4BBA.6080404@mcs.anl.gov>

In the angle run in ~benc/swift-logs/wilde/run121, it looks to me like 
theres a chance that a job attempted to start before its data was 
visible to the node (not sure, just suspicious).

Its a 5-job angle run.  4 jobs worked. The 5th job failed, the one for 
index 1 of a 5-element input file array.

The wf ran with ~wilde/swiftdata/* as the storage and work dir (2 
subdirs below that). So it was on NFS.

All 5 input files are in the shared/ dir, but the failing job is the one 
whose timestamp is last. (0, 2,3,4 worked; 1 failed)

I also got 3 emails from PBS of the form:

PBS Job Id: 1571647.tg-master.uc.teragrid.org
Job Name:   STDIN
Aborted by PBS Server
Job cannot be executed
See Administrator for help

all dated 8:05 PM, three consecutive job ids, *47, 48, 49.

Q: Do these email messages indicate that the job was failed by PBS 
before the app was started, or do these messages indicate a non-zero app 
exit, eg, if its input file was missing?

The input files on shared/ were dated:

drwxr-xr-x    4 wilde    allocate      512 2007-11-03 20:04:33.000000000 
-0500 _concurrent/
-rw-r--r--    1 wilde    allocate 46747037 2007-11-03 20:04:52.000000000 
-0500 pc1.pcap
-rw-r--r--    1 wilde    allocate 46747037 2007-11-03 20:04:55.000000000 
-0500 pc2.pcap
-rw-r--r--    1 wilde    allocate 46747037 2007-11-03 20:04:52.000000000 
-0500 pc3.pcap
-rw-r--r--    1 wilde    allocate 46747037 2007-11-03 20:04:47.000000000 
-0500 pc4.pcap
-rw-r--r--    1 wilde    allocate 46747037 2007-11-03 20:04:51.000000000 
-0500 pc5.pcap
-rw-r--r--    1 wilde    allocate      813 2007-11-03 20:04:33.000000000 
-0500 seq.sh
-rw-r--r--    1 wilde    allocate     4848 2007-11-03 20:04:33.000000000 
-0500 wrapper.sh

The awf3*.log file shows:

2007-11-03 19:04:52,400-0600 DEBUG vdl:dostagein FILE_STAGE_IN_END 
file=file://localhost/pc1.pcap srchost=localhost srcdir= srcn\
ame=pc1.pcap desthost=UC destdir=awf3-20071103-1904-2z266pk3/shared/ 
provider=file
2007-11-03 19:04:52,400-0600 INFO  vdl:dostagein END 
jobid=angle4-ujal0lji - Staging in finished
2007-11-03 19:04:52,401-0600 DEBUG vdl:execute2 JOB_START 
jobid=angle4-ujal0lji tr=angle4 arguments=[pc1.pcap, _concurrent/of-06\
6b25e3-b85f-45ce-a674-fd295fe1feb4--array//elt-0, 
_concurrent/cf-6d786027-4199-47d5-897f-12df44978d24--array//elt-0] 
tmpdir=awf3\
-20071103-1904-2z266pk3/jobs/u/angle4-ujal0lji host=UC

(Note that the logfile for some reason logs times 1 hour behind???)

But the main suspicious thing above is that while the log shows stagin 
complete for pc1.pcap at 4:52 past the hour, the ls shows the file mod 
date to be 4:55 past the hour, while the job was started (queued?) at 4:52.

If the job happened to hit the PBS queue right at the time PBS was doing 
a queue poll, it may have started right away, and somehow started before 
file pc1.pcap was visible to the worker node.  Im not sure what if 
anything in the synchronization prevents this, especially if NFS 
close-to-open consistency is broken. (Which we are very suspicious of on 
this site and with Linux NFS in general).

Lastly, i've run the identical workflow twice more now, and its worked 
with no change both times.

Any ideas or other explanations for what may have happened here?

Also, ideas why the swift log file shows times an hour behind?


From wilde at mcs.anl.gov  Sat Nov  3 23:44:57 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sat, 03 Nov 2007 23:44:57 -0500
Subject: [Swift-devel] Error in syncing job start with input
	file	availability?
In-Reply-To: <472D4BBA.6080404@mcs.anl.gov>
References: <472D4BBA.6080404@mcs.anl.gov>
Message-ID: <472D4E49.7080500@mcs.anl.gov>

forgot to state: the two identical runs after this that worked are in 
same log dir, run122 and run123

- mike

On 11/3/07 11:34 PM, Michael Wilde wrote:
> In the angle run in ~benc/swift-logs/wilde/run121, it looks to me like 
> theres a chance that a job attempted to start before its data was 
> visible to the node (not sure, just suspicious).
> 
> Its a 5-job angle run.  4 jobs worked. The 5th job failed, the one for 
> index 1 of a 5-element input file array.
> 
> The wf ran with ~wilde/swiftdata/* as the storage and work dir (2 
> subdirs below that). So it was on NFS.
> 
> All 5 input files are in the shared/ dir, but the failing job is the one 
> whose timestamp is last. (0, 2,3,4 worked; 1 failed)
> 
> I also got 3 emails from PBS of the form:
> 
> PBS Job Id: 1571647.tg-master.uc.teragrid.org
> Job Name:   STDIN
> Aborted by PBS Server
> Job cannot be executed
> See Administrator for help
> 
> all dated 8:05 PM, three consecutive job ids, *47, 48, 49.
> 
> Q: Do these email messages indicate that the job was failed by PBS 
> before the app was started, or do these messages indicate a non-zero app 
> exit, eg, if its input file was missing?
> 
> The input files on shared/ were dated:
> 
> drwxr-xr-x    4 wilde    allocate      512 2007-11-03 20:04:33.000000000 
> -0500 _concurrent/
> -rw-r--r--    1 wilde    allocate 46747037 2007-11-03 20:04:52.000000000 
> -0500 pc1.pcap
> -rw-r--r--    1 wilde    allocate 46747037 2007-11-03 20:04:55.000000000 
> -0500 pc2.pcap
> -rw-r--r--    1 wilde    allocate 46747037 2007-11-03 20:04:52.000000000 
> -0500 pc3.pcap
> -rw-r--r--    1 wilde    allocate 46747037 2007-11-03 20:04:47.000000000 
> -0500 pc4.pcap
> -rw-r--r--    1 wilde    allocate 46747037 2007-11-03 20:04:51.000000000 
> -0500 pc5.pcap
> -rw-r--r--    1 wilde    allocate      813 2007-11-03 20:04:33.000000000 
> -0500 seq.sh
> -rw-r--r--    1 wilde    allocate     4848 2007-11-03 20:04:33.000000000 
> -0500 wrapper.sh
> 
> The awf3*.log file shows:
> 
> 2007-11-03 19:04:52,400-0600 DEBUG vdl:dostagein FILE_STAGE_IN_END 
> file=file://localhost/pc1.pcap srchost=localhost srcdir= srcn\
> ame=pc1.pcap desthost=UC destdir=awf3-20071103-1904-2z266pk3/shared/ 
> provider=file
> 2007-11-03 19:04:52,400-0600 INFO  vdl:dostagein END 
> jobid=angle4-ujal0lji - Staging in finished
> 2007-11-03 19:04:52,401-0600 DEBUG vdl:execute2 JOB_START 
> jobid=angle4-ujal0lji tr=angle4 arguments=[pc1.pcap, _concurrent/of-06\
> 6b25e3-b85f-45ce-a674-fd295fe1feb4--array//elt-0, 
> _concurrent/cf-6d786027-4199-47d5-897f-12df44978d24--array//elt-0] 
> tmpdir=awf3\
> -20071103-1904-2z266pk3/jobs/u/angle4-ujal0lji host=UC
> 
> (Note that the logfile for some reason logs times 1 hour behind???)
> 
> But the main suspicious thing above is that while the log shows stagin 
> complete for pc1.pcap at 4:52 past the hour, the ls shows the file mod 
> date to be 4:55 past the hour, while the job was started (queued?) at 4:52.
> 
> If the job happened to hit the PBS queue right at the time PBS was doing 
> a queue poll, it may have started right away, and somehow started before 
> file pc1.pcap was visible to the worker node.  Im not sure what if 
> anything in the synchronization prevents this, especially if NFS 
> close-to-open consistency is broken. (Which we are very suspicious of on 
> this site and with Linux NFS in general).
> 
> Lastly, i've run the identical workflow twice more now, and its worked 
> with no change both times.
> 
> Any ideas or other explanations for what may have happened here?
> 
> Also, ideas why the swift log file shows times an hour behind?
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> 


From benc at hawaga.org.uk  Sat Nov  3 23:57:39 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sun, 4 Nov 2007 04:57:39 +0000 (GMT)
Subject: [Swift-devel] Error in syncing job start with input file
	availability?
In-Reply-To: <472D4BBA.6080404@mcs.anl.gov>
References: <472D4BBA.6080404@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0711040451220.4849@dildano.hawaga.org.uk>


On Sat, 3 Nov 2007, Michael Wilde wrote:

> Also, ideas why the swift log file shows times an hour behind?

I think they aren't wrong.

The UTC offset is listed as -6.

If you're expecting the times to be Chicago local times then they would be 
an hour different and there would be a -5 UTC offset.

Most likely this is caused by an outdated Java that isn't aware of the US 
federal Energy Policy Act of 2005 (I've encountered at least one such this 
week) and believes that US daylight savings time ended last week.

However, as of tomorrow, all will be rectified as Chicago really will be 
using UTC-6.

-- 


From hategan at mcs.anl.gov  Sat Nov  3 23:59:32 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 03 Nov 2007 23:59:32 -0500
Subject: [Swift-devel] Error in syncing job start with input file
	availability?
In-Reply-To: <472D4BBA.6080404@mcs.anl.gov>
References: <472D4BBA.6080404@mcs.anl.gov>
Message-ID: <1194152372.10816.2.camel@blabla.mcs.anl.gov>

Looks to me like a problem with PBS rather than something with the jobs.
So I don't think this is worth investigating. It belongs to the "random
bad things happen on occasion" class of problems, for which we have
restarts and scoring.

Mihael

On Sat, 2007-11-03 at 23:34 -0500, Michael Wilde wrote:
> In the angle run in ~benc/swift-logs/wilde/run121, it looks to me like 
> theres a chance that a job attempted to start before its data was 
> visible to the node (not sure, just suspicious).
> 
> Its a 5-job angle run.  4 jobs worked. The 5th job failed, the one for 
> index 1 of a 5-element input file array.
> 
> The wf ran with ~wilde/swiftdata/* as the storage and work dir (2 
> subdirs below that). So it was on NFS.
> 
> All 5 input files are in the shared/ dir, but the failing job is the one 
> whose timestamp is last. (0, 2,3,4 worked; 1 failed)
> 
> I also got 3 emails from PBS of the form:
> 
> PBS Job Id: 1571647.tg-master.uc.teragrid.org
> Job Name:   STDIN
> Aborted by PBS Server
> Job cannot be executed
> See Administrator for help
> 
> all dated 8:05 PM, three consecutive job ids, *47, 48, 49.
> 
> Q: Do these email messages indicate that the job was failed by PBS 
> before the app was started, or do these messages indicate a non-zero app 
> exit, eg, if its input file was missing?
> 
> The input files on shared/ were dated:
> 
> drwxr-xr-x    4 wilde    allocate      512 2007-11-03 20:04:33.000000000 
> -0500 _concurrent/
> -rw-r--r--    1 wilde    allocate 46747037 2007-11-03 20:04:52.000000000 
> -0500 pc1.pcap
> -rw-r--r--    1 wilde    allocate 46747037 2007-11-03 20:04:55.000000000 
> -0500 pc2.pcap
> -rw-r--r--    1 wilde    allocate 46747037 2007-11-03 20:04:52.000000000 
> -0500 pc3.pcap
> -rw-r--r--    1 wilde    allocate 46747037 2007-11-03 20:04:47.000000000 
> -0500 pc4.pcap
> -rw-r--r--    1 wilde    allocate 46747037 2007-11-03 20:04:51.000000000 
> -0500 pc5.pcap
> -rw-r--r--    1 wilde    allocate      813 2007-11-03 20:04:33.000000000 
> -0500 seq.sh
> -rw-r--r--    1 wilde    allocate     4848 2007-11-03 20:04:33.000000000 
> -0500 wrapper.sh
> 
> The awf3*.log file shows:
> 
> 2007-11-03 19:04:52,400-0600 DEBUG vdl:dostagein FILE_STAGE_IN_END 
> file=file://localhost/pc1.pcap srchost=localhost srcdir= srcn\
> ame=pc1.pcap desthost=UC destdir=awf3-20071103-1904-2z266pk3/shared/ 
> provider=file
> 2007-11-03 19:04:52,400-0600 INFO  vdl:dostagein END 
> jobid=angle4-ujal0lji - Staging in finished
> 2007-11-03 19:04:52,401-0600 DEBUG vdl:execute2 JOB_START 
> jobid=angle4-ujal0lji tr=angle4 arguments=[pc1.pcap, _concurrent/of-06\
> 6b25e3-b85f-45ce-a674-fd295fe1feb4--array//elt-0, 
> _concurrent/cf-6d786027-4199-47d5-897f-12df44978d24--array//elt-0] 
> tmpdir=awf3\
> -20071103-1904-2z266pk3/jobs/u/angle4-ujal0lji host=UC
> 
> (Note that the logfile for some reason logs times 1 hour behind???)
> 
> But the main suspicious thing above is that while the log shows stagin 
> complete for pc1.pcap at 4:52 past the hour, the ls shows the file mod 
> date to be 4:55 past the hour, while the job was started (queued?) at 4:52.
> 
> If the job happened to hit the PBS queue right at the time PBS was doing 
> a queue poll, it may have started right away, and somehow started before 
> file pc1.pcap was visible to the worker node.  Im not sure what if 
> anything in the synchronization prevents this, especially if NFS 
> close-to-open consistency is broken. (Which we are very suspicious of on 
> this site and with Linux NFS in general).
> 
> Lastly, i've run the identical workflow twice more now, and its worked 
> with no change both times.
> 
> Any ideas or other explanations for what may have happened here?
> 
> Also, ideas why the swift log file shows times an hour behind?
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From benc at hawaga.org.uk  Sun Nov  4 00:18:29 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sun, 4 Nov 2007 05:18:29 +0000 (GMT)
Subject: [Swift-devel] Error in syncing job start with input file
	availability?
In-Reply-To: <472D4BBA.6080404@mcs.anl.gov>
References: <472D4BBA.6080404@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0711040514230.4849@dildano.hawaga.org.uk>


On Sat, 3 Nov 2007, Michael Wilde wrote:

> Q: Do these email messages indicate that the job was failed by PBS before the
> app was started, or do these messages indicate a non-zero app exit, eg, if its
> input file was missing?

I don't know what the different PBS errors mean.

That job never finished as far as swift is concerned (or at least swift 
exited before logging anything) - perhaps you're running with lazy errors 
turned off (which is the default at the moment; I am undecided whether off 
or on is the best default).

> -rw-r--r--    1 wilde    allocate 46747037 2007-11-03 20:04:52.000000000 -0500
> pc1.pcap

> 2007-11-03 19:04:52,401-0600 DEBUG vdl:execute2 JOB_START
> jobid=angle4-ujal0lji tr=angle4 arguments=[pc1.pcap, _concurrent/of-06\

> But the main suspicious thing above is that while the log shows stagin
> complete for pc1.pcap at 4:52 past the hour, the ls shows the file mod date to
> be 4:55 past the hour, while the job was started (queued?) at 4:52.

mod date is 4:52.

> If the job happened to hit the PBS queue right at the time PBS was doing a
> queue poll, it may have started right away, and somehow started before file
> pc1.pcap was visible to the worker node.  Im not sure what if anything in the
> synchronization prevents this, especially if NFS close-to-open consistency is
> broken. (Which we are very suspicious of on this site and with Linux NFS in
> general).

What site? Can you use a different FS?

-- 


From wilde at mcs.anl.gov  Sun Nov  4 07:45:41 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 04 Nov 2007 07:45:41 -0600
Subject: [Swift-devel] Error in syncing job start with input
	file	availability?
In-Reply-To: <1194152372.10816.2.camel@blabla.mcs.anl.gov>
References: <472D4BBA.6080404@mcs.anl.gov>
	<1194152372.10816.2.camel@blabla.mcs.anl.gov>
Message-ID: <472DCD05.4070609@mcs.anl.gov>

On 11/4/07 12:18 AM, Ben Clifford wrote:

 >> But the main suspicious thing above is that while the log shows stagin
 >> complete for pc1.pcap at 4:52 past the hour, the ls shows the file 
mod date to
 >> be 4:55 past the hour, while the job was started (queued?) at 4:52.
 >
 > mod date is 4:52.

I got my file and job mixed up. The file was pc2.pcap, the mod date was 
4:55, but so was the job start time, so that look ok. My mistake.

 >
 > What site? Can you use a different FS?
 >

uc-teragrid. I will experiment with both nfs and gpfs.
Did you determine with Andrew which is faster? More reliable?

On 11/3/07 11:59 PM, Mihael Hategan wrote:
> Looks to me like a problem with PBS rather than something with the jobs.
> So I don't think this is worth investigating. It belongs to the "random
> bad things happen on occasion" class of problems, for which we have
> restarts and scoring.

Possibly. In this case the job was re-run twice (3 total, within a 
minute) and all three failed, all got the same PBS error message emailed 
to me. I agree, not worth investigating unless it happens more.

grep JOB_START a*.log | grep pc2

2007-11-03 19:04:55,814-0600 DEBUG vdl:execute2 JOB_START 
jobid=angle4-tjal0lji tr=angle4 arguments=[pc2.pcap, 
_concurrent/of-066b25e3-b85f-45ce-a674-fd295fe1feb4--array//elt-1, 
_concurrent/cf-6d786027-4199-47d5-897f-12df44978d24--array//elt-1] 
tmpdir=awf3-20071103-1904-2z266pk3/jobs/t/angle4-tjal0lji host=UC
2007-11-03 19:05:29,495-0600 DEBUG vdl:execute2 JOB_START 
jobid=angle4-vjal0lji tr=angle4 arguments=[pc2.pcap, 
_concurrent/of-066b25e3-b85f-45ce-a674-fd295fe1feb4--array//elt-1, 
_concurrent/cf-6d786027-4199-47d5-897f-12df44978d24--array//elt-1] 
tmpdir=awf3-20071103-1904-2z266pk3/jobs/v/angle4-vjal0lji host=UC
2007-11-03 19:05:42,678-0600 DEBUG vdl:execute2 JOB_START 
jobid=angle4-xjal0lji tr=angle4 arguments=[pc2.pcap, 
_concurrent/of-066b25e3-b85f-45ce-a674-fd295fe1feb4--array//elt-1, 
_concurrent/cf-6d786027-4199-47d5-897f-12df44978d24--array//elt-1] 
tmpdir=awf3-20071103-1904-2z266pk3/jobs/x/angle4-xjal0lji host=UC
vz$

grep EXCEPT a*.log

2007-11-03 19:05:28,567-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
jobid=angle4-tjal0lji - Application exception: No status file was found. 
Check the shared filesystem on UC
2007-11-03 19:05:41,754-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
jobid=angle4-vjal0lji - Application exception: No status file was found. 
Check the shared filesystem on UC
2007-11-03 19:05:55,048-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
jobid=angle4-xjal0lji - Application exception: No status file was found. 
Check the shared filesystem on UC

- Mike

> 
> Mihael
> 
> On Sat, 2007-11-03 at 23:34 -0500, Michael Wilde wrote:
>> In the angle run in ~benc/swift-logs/wilde/run121, it looks to me like 
>> theres a chance that a job attempted to start before its data was 
>> visible to the node (not sure, just suspicious).
>>
>> Its a 5-job angle run.  4 jobs worked. The 5th job failed, the one for 
>> index 1 of a 5-element input file array.
>>
>> The wf ran with ~wilde/swiftdata/* as the storage and work dir (2 
>> subdirs below that). So it was on NFS.
>>
>> All 5 input files are in the shared/ dir, but the failing job is the one 
>> whose timestamp is last. (0, 2,3,4 worked; 1 failed)
>>
>> I also got 3 emails from PBS of the form:
>>
>> PBS Job Id: 1571647.tg-master.uc.teragrid.org
>> Job Name:   STDIN
>> Aborted by PBS Server
>> Job cannot be executed
>> See Administrator for help
>>
>> all dated 8:05 PM, three consecutive job ids, *47, 48, 49.
>>
>> Q: Do these email messages indicate that the job was failed by PBS 
>> before the app was started, or do these messages indicate a non-zero app 
>> exit, eg, if its input file was missing?
>>
>> The input files on shared/ were dated:
>>
>> drwxr-xr-x    4 wilde    allocate      512 2007-11-03 20:04:33.000000000 
>> -0500 _concurrent/
>> -rw-r--r--    1 wilde    allocate 46747037 2007-11-03 20:04:52.000000000 
>> -0500 pc1.pcap
>> -rw-r--r--    1 wilde    allocate 46747037 2007-11-03 20:04:55.000000000 
>> -0500 pc2.pcap
>> -rw-r--r--    1 wilde    allocate 46747037 2007-11-03 20:04:52.000000000 
>> -0500 pc3.pcap
>> -rw-r--r--    1 wilde    allocate 46747037 2007-11-03 20:04:47.000000000 
>> -0500 pc4.pcap
>> -rw-r--r--    1 wilde    allocate 46747037 2007-11-03 20:04:51.000000000 
>> -0500 pc5.pcap
>> -rw-r--r--    1 wilde    allocate      813 2007-11-03 20:04:33.000000000 
>> -0500 seq.sh
>> -rw-r--r--    1 wilde    allocate     4848 2007-11-03 20:04:33.000000000 
>> -0500 wrapper.sh
>>
>> The awf3*.log file shows:
>>
>> 2007-11-03 19:04:52,400-0600 DEBUG vdl:dostagein FILE_STAGE_IN_END 
>> file=file://localhost/pc1.pcap srchost=localhost srcdir= srcn\
>> ame=pc1.pcap desthost=UC destdir=awf3-20071103-1904-2z266pk3/shared/ 
>> provider=file
>> 2007-11-03 19:04:52,400-0600 INFO  vdl:dostagein END 
>> jobid=angle4-ujal0lji - Staging in finished
>> 2007-11-03 19:04:52,401-0600 DEBUG vdl:execute2 JOB_START 
>> jobid=angle4-ujal0lji tr=angle4 arguments=[pc1.pcap, _concurrent/of-06\
>> 6b25e3-b85f-45ce-a674-fd295fe1feb4--array//elt-0, 
>> _concurrent/cf-6d786027-4199-47d5-897f-12df44978d24--array//elt-0] 
>> tmpdir=awf3\
>> -20071103-1904-2z266pk3/jobs/u/angle4-ujal0lji host=UC
>>
>> (Note that the logfile for some reason logs times 1 hour behind???)
>>
>> But the main suspicious thing above is that while the log shows stagin 
>> complete for pc1.pcap at 4:52 past the hour, the ls shows the file mod 
>> date to be 4:55 past the hour, while the job was started (queued?) at 4:52.
>>
>> If the job happened to hit the PBS queue right at the time PBS was doing 
>> a queue poll, it may have started right away, and somehow started before 
>> file pc1.pcap was visible to the worker node.  Im not sure what if 
>> anything in the synchronization prevents this, especially if NFS 
>> close-to-open consistency is broken. (Which we are very suspicious of on 
>> this site and with Linux NFS in general).
>>
>> Lastly, i've run the identical workflow twice more now, and its worked 
>> with no change both times.
>>
>> Any ideas or other explanations for what may have happened here?
>>
>> Also, ideas why the swift log file shows times an hour behind?
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>
> 
> 


From benc at hawaga.org.uk  Sun Nov  4 07:51:53 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sun, 4 Nov 2007 13:51:53 +0000 (GMT)
Subject: [Swift-devel] Error in syncing job start with input file
	availability?
In-Reply-To: <472DCD05.4070609@mcs.anl.gov>
References: <472D4BBA.6080404@mcs.anl.gov>
	<1194152372.10816.2.camel@blabla.mcs.anl.gov>
	<472DCD05.4070609@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0711041350200.4849@dildano.hawaga.org.uk>


On Sun, 4 Nov 2007, Michael Wilde wrote:

> uc-teragrid. I will experiment with both nfs and gpfs.
> Did you determine with Andrew which is faster? More reliable?

We didn't try NFS so I don't really have any objective data about the two. 
Though it strikes me i shoul dcheck that the logging is tracking where on 
the remote site the run directory is placed so that we can tell later on.

-- 


From wilde at mcs.anl.gov  Sun Nov  4 11:03:53 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 04 Nov 2007 11:03:53 -0600
Subject: [Swift-devel] GT2 service down on uc-teragrid
Message-ID: <472DFB79.2000708@mcs.anl.gov>

Ti, TG-Help,

I'm unable to submit globus jobs to tg-grid via GRAM2:

vz$ globus-job-run tg-grid.uc.teragrid.org /bin/hostname
GRAM Job submission failed because the connection to the server failed 
(check host and port) (error code 12)
vz$

I can ping the host and telnet can connect to port 2119.

It was running OK last night around 10PM.

We're using this machine for SC07 tutorials, demos and challenge 
competitions, so anything you could do to resolve quickly would be 
appreciated.

(Any chance the DST change affected it?)

Thanks,

Mike
-- 
Michael Wilde
Computation Institute
University of Chicago and Argonne National Laboratory
5640 S. Ellis Av, Suite 405
Chicago, IL 60637 USA
708-203-9548


From wilde at mcs.anl.gov  Sun Nov  4 19:37:37 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 04 Nov 2007 19:37:37 -0600
Subject: [Swift-devel] Kickstart runs on localhost are failing
Message-ID: <472E73E1.9030607@mcs.anl.gov>

I get job exceptions when I run with kickstart on localhost,
regardless of whether clustered or not.

The jobs seem to run (3x each) but fail each time. First time gets 
"Application exception: Missing argument jobdir", 2nd & 3rd get 
"Application exception: The cache already contains 
localhost:awf4-20071104-1843-ds8hn11a..."

Clustered run is in run137, unclustered in run138
The latter log dir has a file swiftdata.find.out which lists all the 
files in my data dir (has a local/ branch at the top for localhost jobs).

Error in both cases is below.

Will try next doing kickstart in both ways via gram.

- Mike

2007-11-04 18:47:40,946-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
jobid=angle4-cgqcmmji - Application exception: Missing argument jobdir 
for sys:element(rhost, wfdir, jobid, jobdir)
2007-11-04 18:47:41,085-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
identity=urn:0-2-1194223436415) setting status to Failed 
org.globus.cog.abstraction.impl.file.FileNotFoundException: 
angle4-cgqcmmji-stderr.txt not found.
2007-11-04 18:47:41,344-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
identity=urn:0-2-1194223436424) setting status to Failed 
org.globus.cog.abstraction.impl.file.FileNotFoundException: 
angle4-cgqcmmji-stdout.txt not found.
2007-11-04 18:47:41,503-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
jobid=angle4-bgqcmmji - Application exception: Missing argument jobdir 
for sys:element(rhost, wfdir, jobid, jobdir)
2007-11-04 18:47:41,553-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
identity=urn:0-1-1194223436458) setting status to Failed 
org.globus.cog.abstraction.impl.file.FileNotFoundException: 
angle4-bgqcmmji-stderr.txt not found.
2007-11-04 18:47:41,638-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
identity=urn:0-1-1194223436467) setting status to Failed 
org.globus.cog.abstraction.impl.file.FileNotFoundException: 
angle4-bgqcmmji-stdout.txt not found.
2007-11-04 18:47:41,882-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
jobid=angle4-agqcmmji - Application exception: Missing argument jobdir 
for sys:element(rhost, wfdir, jobid, jobdir)
2007-11-04 18:47:41,987-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
identity=urn:0-3-1194223436500) setting status to Failed 
org.globus.cog.abstraction.impl.file.FileNotFoundException: 
angle4-agqcmmji-stderr.txt not found.
2007-11-04 18:47:42,047-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
identity=urn:0-3-1194223436507) setting status to Failed 
org.globus.cog.abstraction.impl.file.FileNotFoundException: 
angle4-agqcmmji-stdout.txt not found.
2007-11-04 18:51:18,439-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
jobid=angle4-dgqcmmji - Application exception: The cache already 
contains localhost:awf4-20071104-1843-ds8hn11a/shared/cf0000.angle.
2007-11-04 18:51:18,628-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
identity=urn:0-2-1194223436543) setting status to Failed 
org.globus.cog.abstraction.impl.file.FileNotFoundException: 
angle4-dgqcmmji-stderr.txt not found.
2007-11-04 18:51:18,762-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
identity=urn:0-2-1194223436550) setting status to Failed 
org.globus.cog.abstraction.impl.file.FileNotFoundException: 
angle4-dgqcmmji-stdout.txt not found.
2007-11-04 18:51:25,976-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
jobid=angle4-egqcmmji - Application exception: The cache already 
contains localhost:awf4-20071104-1843-ds8hn11a/shared/of0002.angle.
2007-11-04 18:51:26,401-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
identity=urn:0-1-1194223436585) setting status to Failed 
org.globus.cog.abstraction.impl.file.FileNotFoundException: 
angle4-egqcmmji-stderr.txt not found.
2007-11-04 18:51:26,726-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
identity=urn:0-1-1194223436592) setting status to Failed 
org.globus.cog.abstraction.impl.file.FileNotFoundException: 
angle4-egqcmmji-stdout.txt not found.
2007-11-04 18:51:28,040-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
jobid=angle4-fgqcmmji - Application exception: The cache already 
contains localhost:awf4-20071104-1843-ds8hn11a/shared/cf0001.angle.
2007-11-04 18:51:28,492-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
identity=urn:0-3-1194223436627) setting status to Failed 
org.globus.cog.abstraction.impl.file.FileNotFoundException: 
angle4-fgqcmmji-stderr.txt not found.
2007-11-04 18:51:28,816-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
identity=urn:0-3-1194223436634) setting status to Failed 
org.globus.cog.abstraction.impl.file.FileNotFoundException: 
angle4-fgqcmmji-stdout.txt not found.
2007-11-04 18:54:44,088-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
jobid=angle4-hgqcmmji - Application exception: The cache already 
contains localhost:awf4-20071104-1843-ds8hn11a/shared/of0002.angle.
2007-11-04 18:54:44,440-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
identity=urn:0-1-1194223436670) setting status to Failed 
org.globus.cog.abstraction.impl.file.FileNotFoundException: 
angle4-hgqcmmji-stderr.txt not found.
2007-11-04 18:54:44,652-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
identity=urn:0-1-1194223436677) setting status to Failed 
org.globus.cog.abstraction.impl.file.FileNotFoundException: 
angle4-hgqcmmji-stdout.txt not found.
2007-11-04 18:54:44,741-0600 DEBUG VDL2ExecutionContext Exception in angle4:
Exception in angle4:
         sys:exception @ vdl-int.k, line: 423
         at 
org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29)
2007-11-04 18:54:46,190-0600 INFO  ExecutionContext Detailed exception:
Exception in angle4:
         sys:exception @ vdl-int.k, line: 423
         at 
org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29)


From benc at hawaga.org.uk  Sun Nov  4 19:40:04 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 5 Nov 2007 01:40:04 +0000 (GMT)
Subject: [Swift-devel] Kickstart runs on localhost are failing
In-Reply-To: <472E73E1.9030607@mcs.anl.gov>
References: <472E73E1.9030607@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0711050139050.20932@dildano.hawaga.org.uk>

try r1456 - that has a kickstart record transfer fix.

> The jobs seem to run (3x each) but fail each time. First time gets
> "Application exception: Missing argument jobdir"

r1456 fixes this.

> , 2nd & 3rd get "Application
> exception: The cache already contains
> localhost:awf4-20071104-1843-ds8hn11a..."

however, that suggests that there's a cache management problem now that I 
will investigate.

-- 


From wilde at mcs.anl.gov  Sun Nov  4 20:38:22 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 04 Nov 2007 20:38:22 -0600
Subject: [Swift-devel] Error in syncing job start with
	input	file	availability?
In-Reply-To: <472DCD05.4070609@mcs.anl.gov>
References: <472D4BBA.6080404@mcs.anl.gov>	<1194152372.10816.2.camel@blabla.mcs.anl.gov>
	<472DCD05.4070609@mcs.anl.gov>
Message-ID: <472E821E.50109@mcs.anl.gov>

A very similar error occured a bit ago, tonight.
Its in ~benc/swiftlogs/wilde/run143
along with the same 3 PBS emailed errors in pbs.errors.out.

This was using r1453.  Upgrading too 1456 now.

Just fyi - dont bother with this till we see it with latest release.

Also, this run was with kickstart; yesterdays was not.

- Mike


On 11/4/07 7:45 AM, Michael Wilde wrote:
> On 11/4/07 12:18 AM, Ben Clifford wrote:
> 
>  >> But the main suspicious thing above is that while the log shows stagin
>  >> complete for pc1.pcap at 4:52 past the hour, the ls shows the file 
> mod date to
>  >> be 4:55 past the hour, while the job was started (queued?) at 4:52.
>  >
>  > mod date is 4:52.
> 
> I got my file and job mixed up. The file was pc2.pcap, the mod date was 
> 4:55, but so was the job start time, so that look ok. My mistake.
> 
>  >
>  > What site? Can you use a different FS?
>  >
> 
> uc-teragrid. I will experiment with both nfs and gpfs.
> Did you determine with Andrew which is faster? More reliable?
> 
> On 11/3/07 11:59 PM, Mihael Hategan wrote:
>> Looks to me like a problem with PBS rather than something with the jobs.
>> So I don't think this is worth investigating. It belongs to the "random
>> bad things happen on occasion" class of problems, for which we have
>> restarts and scoring.
> 
> Possibly. In this case the job was re-run twice (3 total, within a 
> minute) and all three failed, all got the same PBS error message emailed 
> to me. I agree, not worth investigating unless it happens more.
> 
> grep JOB_START a*.log | grep pc2
> 
> 2007-11-03 19:04:55,814-0600 DEBUG vdl:execute2 JOB_START 
> jobid=angle4-tjal0lji tr=angle4 arguments=[pc2.pcap, 
> _concurrent/of-066b25e3-b85f-45ce-a674-fd295fe1feb4--array//elt-1, 
> _concurrent/cf-6d786027-4199-47d5-897f-12df44978d24--array//elt-1] 
> tmpdir=awf3-20071103-1904-2z266pk3/jobs/t/angle4-tjal0lji host=UC
> 2007-11-03 19:05:29,495-0600 DEBUG vdl:execute2 JOB_START 
> jobid=angle4-vjal0lji tr=angle4 arguments=[pc2.pcap, 
> _concurrent/of-066b25e3-b85f-45ce-a674-fd295fe1feb4--array//elt-1, 
> _concurrent/cf-6d786027-4199-47d5-897f-12df44978d24--array//elt-1] 
> tmpdir=awf3-20071103-1904-2z266pk3/jobs/v/angle4-vjal0lji host=UC
> 2007-11-03 19:05:42,678-0600 DEBUG vdl:execute2 JOB_START 
> jobid=angle4-xjal0lji tr=angle4 arguments=[pc2.pcap, 
> _concurrent/of-066b25e3-b85f-45ce-a674-fd295fe1feb4--array//elt-1, 
> _concurrent/cf-6d786027-4199-47d5-897f-12df44978d24--array//elt-1] 
> tmpdir=awf3-20071103-1904-2z266pk3/jobs/x/angle4-xjal0lji host=UC
> vz$
> 
> grep EXCEPT a*.log
> 
> 2007-11-03 19:05:28,567-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
> jobid=angle4-tjal0lji - Application exception: No status file was found. 
> Check the shared filesystem on UC
> 2007-11-03 19:05:41,754-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
> jobid=angle4-vjal0lji - Application exception: No status file was found. 
> Check the shared filesystem on UC
> 2007-11-03 19:05:55,048-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
> jobid=angle4-xjal0lji - Application exception: No status file was found. 
> Check the shared filesystem on UC
> 
> - Mike
> 
>>
>> Mihael
>>
>> On Sat, 2007-11-03 at 23:34 -0500, Michael Wilde wrote:
>>> In the angle run in ~benc/swift-logs/wilde/run121, it looks to me 
>>> like theres a chance that a job attempted to start before its data 
>>> was visible to the node (not sure, just suspicious).
>>>
>>> Its a 5-job angle run.  4 jobs worked. The 5th job failed, the one 
>>> for index 1 of a 5-element input file array.
>>>
>>> The wf ran with ~wilde/swiftdata/* as the storage and work dir (2 
>>> subdirs below that). So it was on NFS.
>>>
>>> All 5 input files are in the shared/ dir, but the failing job is the 
>>> one whose timestamp is last. (0, 2,3,4 worked; 1 failed)
>>>
>>> I also got 3 emails from PBS of the form:
>>>
>>> PBS Job Id: 1571647.tg-master.uc.teragrid.org
>>> Job Name:   STDIN
>>> Aborted by PBS Server
>>> Job cannot be executed
>>> See Administrator for help
>>>
>>> all dated 8:05 PM, three consecutive job ids, *47, 48, 49.
>>>
>>> Q: Do these email messages indicate that the job was failed by PBS 
>>> before the app was started, or do these messages indicate a non-zero 
>>> app exit, eg, if its input file was missing?
>>>
>>> The input files on shared/ were dated:
>>>
>>> drwxr-xr-x    4 wilde    allocate      512 2007-11-03 
>>> 20:04:33.000000000 -0500 _concurrent/
>>> -rw-r--r--    1 wilde    allocate 46747037 2007-11-03 
>>> 20:04:52.000000000 -0500 pc1.pcap
>>> -rw-r--r--    1 wilde    allocate 46747037 2007-11-03 
>>> 20:04:55.000000000 -0500 pc2.pcap
>>> -rw-r--r--    1 wilde    allocate 46747037 2007-11-03 
>>> 20:04:52.000000000 -0500 pc3.pcap
>>> -rw-r--r--    1 wilde    allocate 46747037 2007-11-03 
>>> 20:04:47.000000000 -0500 pc4.pcap
>>> -rw-r--r--    1 wilde    allocate 46747037 2007-11-03 
>>> 20:04:51.000000000 -0500 pc5.pcap
>>> -rw-r--r--    1 wilde    allocate      813 2007-11-03 
>>> 20:04:33.000000000 -0500 seq.sh
>>> -rw-r--r--    1 wilde    allocate     4848 2007-11-03 
>>> 20:04:33.000000000 -0500 wrapper.sh
>>>
>>> The awf3*.log file shows:
>>>
>>> 2007-11-03 19:04:52,400-0600 DEBUG vdl:dostagein FILE_STAGE_IN_END 
>>> file=file://localhost/pc1.pcap srchost=localhost srcdir= srcn\
>>> ame=pc1.pcap desthost=UC destdir=awf3-20071103-1904-2z266pk3/shared/ 
>>> provider=file
>>> 2007-11-03 19:04:52,400-0600 INFO  vdl:dostagein END 
>>> jobid=angle4-ujal0lji - Staging in finished
>>> 2007-11-03 19:04:52,401-0600 DEBUG vdl:execute2 JOB_START 
>>> jobid=angle4-ujal0lji tr=angle4 arguments=[pc1.pcap, _concurrent/of-06\
>>> 6b25e3-b85f-45ce-a674-fd295fe1feb4--array//elt-0, 
>>> _concurrent/cf-6d786027-4199-47d5-897f-12df44978d24--array//elt-0] 
>>> tmpdir=awf3\
>>> -20071103-1904-2z266pk3/jobs/u/angle4-ujal0lji host=UC
>>>
>>> (Note that the logfile for some reason logs times 1 hour behind???)
>>>
>>> But the main suspicious thing above is that while the log shows 
>>> stagin complete for pc1.pcap at 4:52 past the hour, the ls shows the 
>>> file mod date to be 4:55 past the hour, while the job was started 
>>> (queued?) at 4:52.
>>>
>>> If the job happened to hit the PBS queue right at the time PBS was 
>>> doing a queue poll, it may have started right away, and somehow 
>>> started before file pc1.pcap was visible to the worker node.  Im not 
>>> sure what if anything in the synchronization prevents this, 
>>> especially if NFS close-to-open consistency is broken. (Which we are 
>>> very suspicious of on this site and with Linux NFS in general).
>>>
>>> Lastly, i've run the identical workflow twice more now, and its 
>>> worked with no change both times.
>>>
>>> Any ideas or other explanations for what may have happened here?
>>>
>>> Also, ideas why the swift log file shows times an hour behind?
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>
>>
>>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> 


From benc at hawaga.org.uk  Sun Nov  4 20:53:02 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 5 Nov 2007 02:53:02 +0000 (GMT)
Subject: [Swift-devel] Error in syncing job start with input file
	availability?
In-Reply-To: <472E821E.50109@mcs.anl.gov>
References: <472D4BBA.6080404@mcs.anl.gov>
	<1194152372.10816.2.camel@blabla.mcs.anl.gov>
	<472DCD05.4070609@mcs.anl.gov> <472E821E.50109@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0711050252010.20932@dildano.hawaga.org.uk>


On Sun, 4 Nov 2007, Michael Wilde wrote:

> A very similar error occured a bit ago, tonight.
> Its in ~benc/swiftlogs/wilde/run143

not that I can see...

benc at terminable:~/swift-logs  !1019 
$ find . -name run\*143
benc at terminable:~/swift-logs  !1020 
$ 

-- 


From hategan at mcs.anl.gov  Sun Nov  4 21:07:50 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 04 Nov 2007 21:07:50 -0600
Subject: [Swift-devel] Kickstart runs on localhost are failing
In-Reply-To: <472E73E1.9030607@mcs.anl.gov>
References: <472E73E1.9030607@mcs.anl.gov>
Message-ID: <1194232070.6373.1.camel@blabla.mcs.anl.gov>


On Sun, 2007-11-04 at 19:37 -0600, Michael Wilde wrote:
> I get job exceptions when I run with kickstart on localhost,
> regardless of whether clustered or not.
> 
> The jobs seem to run (3x each) but fail each time. First time gets 
> "Application exception: Missing argument jobdir", 2nd & 3rd get 
> "Application exception: The cache already contains 
> localhost:awf4-20071104-1843-ds8hn11a..."

That probably shouldn't happen unless you're trying to assign to the
same variable twice. Does this work without kickstart?

> 
> Clustered run is in run137, unclustered in run138
> The latter log dir has a file swiftdata.find.out which lists all the 
> files in my data dir (has a local/ branch at the top for localhost jobs).
> 
> Error in both cases is below.
> 
> Will try next doing kickstart in both ways via gram.
> 
> - Mike
> 
> 2007-11-04 18:47:40,946-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
> jobid=angle4-cgqcmmji - Application exception: Missing argument jobdir 
> for sys:element(rhost, wfdir, jobid, jobdir)
> 2007-11-04 18:47:41,085-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> identity=urn:0-2-1194223436415) setting status to Failed 
> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> angle4-cgqcmmji-stderr.txt not found.
> 2007-11-04 18:47:41,344-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> identity=urn:0-2-1194223436424) setting status to Failed 
> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> angle4-cgqcmmji-stdout.txt not found.
> 2007-11-04 18:47:41,503-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
> jobid=angle4-bgqcmmji - Application exception: Missing argument jobdir 
> for sys:element(rhost, wfdir, jobid, jobdir)
> 2007-11-04 18:47:41,553-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> identity=urn:0-1-1194223436458) setting status to Failed 
> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> angle4-bgqcmmji-stderr.txt not found.
> 2007-11-04 18:47:41,638-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> identity=urn:0-1-1194223436467) setting status to Failed 
> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> angle4-bgqcmmji-stdout.txt not found.
> 2007-11-04 18:47:41,882-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
> jobid=angle4-agqcmmji - Application exception: Missing argument jobdir 
> for sys:element(rhost, wfdir, jobid, jobdir)
> 2007-11-04 18:47:41,987-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> identity=urn:0-3-1194223436500) setting status to Failed 
> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> angle4-agqcmmji-stderr.txt not found.
> 2007-11-04 18:47:42,047-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> identity=urn:0-3-1194223436507) setting status to Failed 
> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> angle4-agqcmmji-stdout.txt not found.
> 2007-11-04 18:51:18,439-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
> jobid=angle4-dgqcmmji - Application exception: The cache already 
> contains localhost:awf4-20071104-1843-ds8hn11a/shared/cf0000.angle.
> 2007-11-04 18:51:18,628-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> identity=urn:0-2-1194223436543) setting status to Failed 
> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> angle4-dgqcmmji-stderr.txt not found.
> 2007-11-04 18:51:18,762-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> identity=urn:0-2-1194223436550) setting status to Failed 
> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> angle4-dgqcmmji-stdout.txt not found.
> 2007-11-04 18:51:25,976-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
> jobid=angle4-egqcmmji - Application exception: The cache already 
> contains localhost:awf4-20071104-1843-ds8hn11a/shared/of0002.angle.
> 2007-11-04 18:51:26,401-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> identity=urn:0-1-1194223436585) setting status to Failed 
> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> angle4-egqcmmji-stderr.txt not found.
> 2007-11-04 18:51:26,726-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> identity=urn:0-1-1194223436592) setting status to Failed 
> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> angle4-egqcmmji-stdout.txt not found.
> 2007-11-04 18:51:28,040-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
> jobid=angle4-fgqcmmji - Application exception: The cache already 
> contains localhost:awf4-20071104-1843-ds8hn11a/shared/cf0001.angle.
> 2007-11-04 18:51:28,492-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> identity=urn:0-3-1194223436627) setting status to Failed 
> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> angle4-fgqcmmji-stderr.txt not found.
> 2007-11-04 18:51:28,816-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> identity=urn:0-3-1194223436634) setting status to Failed 
> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> angle4-fgqcmmji-stdout.txt not found.
> 2007-11-04 18:54:44,088-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
> jobid=angle4-hgqcmmji - Application exception: The cache already 
> contains localhost:awf4-20071104-1843-ds8hn11a/shared/of0002.angle.
> 2007-11-04 18:54:44,440-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> identity=urn:0-1-1194223436670) setting status to Failed 
> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> angle4-hgqcmmji-stderr.txt not found.
> 2007-11-04 18:54:44,652-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> identity=urn:0-1-1194223436677) setting status to Failed 
> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> angle4-hgqcmmji-stdout.txt not found.
> 2007-11-04 18:54:44,741-0600 DEBUG VDL2ExecutionContext Exception in angle4:
> Exception in angle4:
>          sys:exception @ vdl-int.k, line: 423
>          at 
> org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29)
> 2007-11-04 18:54:46,190-0600 INFO  ExecutionContext Detailed exception:
> Exception in angle4:
>          sys:exception @ vdl-int.k, line: 423
>          at 
> org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29)
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From hategan at mcs.anl.gov  Sun Nov  4 21:15:39 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 04 Nov 2007 21:15:39 -0600
Subject: [Swift-devel] Kickstart runs on localhost are failing
In-Reply-To: <1194232070.6373.1.camel@blabla.mcs.anl.gov>
References: <472E73E1.9030607@mcs.anl.gov>
	<1194232070.6373.1.camel@blabla.mcs.anl.gov>
Message-ID: <1194232539.6373.3.camel@blabla.mcs.anl.gov>


On Sun, 2007-11-04 at 21:07 -0600, Mihael Hategan wrote:
> On Sun, 2007-11-04 at 19:37 -0600, Michael Wilde wrote:
> > I get job exceptions when I run with kickstart on localhost,
> > regardless of whether clustered or not.
> > 
> > The jobs seem to run (3x each) but fail each time. First time gets 
> > "Application exception: Missing argument jobdir", 2nd & 3rd get 
> > "Application exception: The cache already contains 
> > localhost:awf4-20071104-1843-ds8hn11a..."
> 
> That probably shouldn't happen unless you're trying to assign to the
> same variable twice. Does this work without kickstart?

Where "shouldn't" should be interpreted as "unless there's a bug", which
isn't necessarily unlikely.

> 
> > 
> > Clustered run is in run137, unclustered in run138
> > The latter log dir has a file swiftdata.find.out which lists all the 
> > files in my data dir (has a local/ branch at the top for localhost jobs).
> > 
> > Error in both cases is below.
> > 
> > Will try next doing kickstart in both ways via gram.
> > 
> > - Mike
> > 
> > 2007-11-04 18:47:40,946-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
> > jobid=angle4-cgqcmmji - Application exception: Missing argument jobdir 
> > for sys:element(rhost, wfdir, jobid, jobdir)
> > 2007-11-04 18:47:41,085-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> > identity=urn:0-2-1194223436415) setting status to Failed 
> > org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> > angle4-cgqcmmji-stderr.txt not found.
> > 2007-11-04 18:47:41,344-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> > identity=urn:0-2-1194223436424) setting status to Failed 
> > org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> > angle4-cgqcmmji-stdout.txt not found.
> > 2007-11-04 18:47:41,503-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
> > jobid=angle4-bgqcmmji - Application exception: Missing argument jobdir 
> > for sys:element(rhost, wfdir, jobid, jobdir)
> > 2007-11-04 18:47:41,553-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> > identity=urn:0-1-1194223436458) setting status to Failed 
> > org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> > angle4-bgqcmmji-stderr.txt not found.
> > 2007-11-04 18:47:41,638-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> > identity=urn:0-1-1194223436467) setting status to Failed 
> > org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> > angle4-bgqcmmji-stdout.txt not found.
> > 2007-11-04 18:47:41,882-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
> > jobid=angle4-agqcmmji - Application exception: Missing argument jobdir 
> > for sys:element(rhost, wfdir, jobid, jobdir)
> > 2007-11-04 18:47:41,987-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> > identity=urn:0-3-1194223436500) setting status to Failed 
> > org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> > angle4-agqcmmji-stderr.txt not found.
> > 2007-11-04 18:47:42,047-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> > identity=urn:0-3-1194223436507) setting status to Failed 
> > org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> > angle4-agqcmmji-stdout.txt not found.
> > 2007-11-04 18:51:18,439-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
> > jobid=angle4-dgqcmmji - Application exception: The cache already 
> > contains localhost:awf4-20071104-1843-ds8hn11a/shared/cf0000.angle.
> > 2007-11-04 18:51:18,628-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> > identity=urn:0-2-1194223436543) setting status to Failed 
> > org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> > angle4-dgqcmmji-stderr.txt not found.
> > 2007-11-04 18:51:18,762-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> > identity=urn:0-2-1194223436550) setting status to Failed 
> > org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> > angle4-dgqcmmji-stdout.txt not found.
> > 2007-11-04 18:51:25,976-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
> > jobid=angle4-egqcmmji - Application exception: The cache already 
> > contains localhost:awf4-20071104-1843-ds8hn11a/shared/of0002.angle.
> > 2007-11-04 18:51:26,401-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> > identity=urn:0-1-1194223436585) setting status to Failed 
> > org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> > angle4-egqcmmji-stderr.txt not found.
> > 2007-11-04 18:51:26,726-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> > identity=urn:0-1-1194223436592) setting status to Failed 
> > org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> > angle4-egqcmmji-stdout.txt not found.
> > 2007-11-04 18:51:28,040-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
> > jobid=angle4-fgqcmmji - Application exception: The cache already 
> > contains localhost:awf4-20071104-1843-ds8hn11a/shared/cf0001.angle.
> > 2007-11-04 18:51:28,492-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> > identity=urn:0-3-1194223436627) setting status to Failed 
> > org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> > angle4-fgqcmmji-stderr.txt not found.
> > 2007-11-04 18:51:28,816-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> > identity=urn:0-3-1194223436634) setting status to Failed 
> > org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> > angle4-fgqcmmji-stdout.txt not found.
> > 2007-11-04 18:54:44,088-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
> > jobid=angle4-hgqcmmji - Application exception: The cache already 
> > contains localhost:awf4-20071104-1843-ds8hn11a/shared/of0002.angle.
> > 2007-11-04 18:54:44,440-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> > identity=urn:0-1-1194223436670) setting status to Failed 
> > org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> > angle4-hgqcmmji-stderr.txt not found.
> > 2007-11-04 18:54:44,652-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> > identity=urn:0-1-1194223436677) setting status to Failed 
> > org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> > angle4-hgqcmmji-stdout.txt not found.
> > 2007-11-04 18:54:44,741-0600 DEBUG VDL2ExecutionContext Exception in angle4:
> > Exception in angle4:
> >          sys:exception @ vdl-int.k, line: 423
> >          at 
> > org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29)
> > 2007-11-04 18:54:46,190-0600 INFO  ExecutionContext Detailed exception:
> > Exception in angle4:
> >          sys:exception @ vdl-int.k, line: 423
> >          at 
> > org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29)
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From wilde at mcs.anl.gov  Sun Nov  4 21:20:40 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 04 Nov 2007 21:20:40 -0600
Subject: [Swift-devel] Jobs being aborted by PBS server on
	tg-grid.uc.teragrid.org
Message-ID: <472E8C08.308@mcs.anl.gov>

Im starting to see more frequent problems like this.
Happened once last night to 3 consecutive jobs, and tonight happened 
twice, to 6 jobs.

Ti, could you look in the PBS logs, possibly on the related node(s) and 
see if its looking like a problem on tg-uc or on our side?

Thanks,

Mike


11/3 8:05 PM - 3 failures
  Job IDs 1571647, 48, & 49
11/4 7:46 PM - 3 failures
  Job IDs 1572031, 33, & 34
11/4 8:56 - 8:57 PM
  1572040, 42, 43

All errors have the format below.

Swift retries failing jobs 3 times, hence the groups of 3 above.


-------- Original Message --------
Subject: PBS JOB 1572043.tg-master.uc.teragrid.org
Date: Sun,  4 Nov 2007 20:57:11 -0600 (CST)
From: adm at tg-master.uc.teragrid.org (root)
To: wilde at tg-grid1.uc.teragrid.org

PBS Job Id: 1572043.tg-master.uc.teragrid.org
Job Name:   STDIN
Aborted by PBS Server
Job cannot be executed
See Administrator for help


From wilde at mcs.anl.gov  Sun Nov  4 21:26:17 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 04 Nov 2007 21:26:17 -0600
Subject: [Swift-devel] Kickstart runs on localhost are failing
In-Reply-To: <1194232070.6373.1.camel@blabla.mcs.anl.gov>
References: <472E73E1.9030607@mcs.anl.gov>
	<1194232070.6373.1.camel@blabla.mcs.anl.gov>
Message-ID: <472E8D59.8020206@mcs.anl.gov>

[resending to cc swift-devel]

On 11/4/07 9:07 PM, Mihael Hategan wrote:
> On Sun, 2007-11-04 at 19:37 -0600, Michael Wilde wrote:
>> I get job exceptions when I run with kickstart on localhost,
>> regardless of whether clustered or not.
>>
>> The jobs seem to run (3x each) but fail each time. First time gets 
>> "Application exception: Missing argument jobdir", 2nd & 3rd get 
>> "Application exception: The cache already contains 
>> localhost:awf4-20071104-1843-ds8hn11a..."
> 
> That probably shouldn't happen unless you're trying to assign to the
> same variable twice. Does this work without kickstart?

Yes, it works without kickstart (r1453)
Trying again on r1456.

It looked to me like the "cache already contains" error was a result of
the first failure (which Ben thinks he's fixed in 1456 if I understand
right) leaving the cache in a state where the retry gets confused.

I should note that in all these cases, I got all the output, so the job
runs despite the first error, likely causing the duplicate cache entry
problems.

- Mike

> 
>> Clustered run is in run137, unclustered in run138
>> The latter log dir has a file swiftdata.find.out which lists all the 
>> files in my data dir (has a local/ branch at the top for localhost jobs).
>>
>> Error in both cases is below.
>>
>> Will try next doing kickstart in both ways via gram.
>>
>> - Mike
>>
>> 2007-11-04 18:47:40,946-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
>> jobid=angle4-cgqcmmji - Application exception: Missing argument jobdir 
>> for sys:element(rhost, wfdir, jobid, jobdir)
>> 2007-11-04 18:47:41,085-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
>> identity=urn:0-2-1194223436415) setting status to Failed 
>> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
>> angle4-cgqcmmji-stderr.txt not found.
>> 2007-11-04 18:47:41,344-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
>> identity=urn:0-2-1194223436424) setting status to Failed 
>> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
>> angle4-cgqcmmji-stdout.txt not found.
>> 2007-11-04 18:47:41,503-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
>> jobid=angle4-bgqcmmji - Application exception: Missing argument jobdir 
>> for sys:element(rhost, wfdir, jobid, jobdir)
>> 2007-11-04 18:47:41,553-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
>> identity=urn:0-1-1194223436458) setting status to Failed 
>> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
>> angle4-bgqcmmji-stderr.txt not found.
>> 2007-11-04 18:47:41,638-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
>> identity=urn:0-1-1194223436467) setting status to Failed 
>> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
>> angle4-bgqcmmji-stdout.txt not found.
>> 2007-11-04 18:47:41,882-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
>> jobid=angle4-agqcmmji - Application exception: Missing argument jobdir 
>> for sys:element(rhost, wfdir, jobid, jobdir)
>> 2007-11-04 18:47:41,987-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
>> identity=urn:0-3-1194223436500) setting status to Failed 
>> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
>> angle4-agqcmmji-stderr.txt not found.
>> 2007-11-04 18:47:42,047-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
>> identity=urn:0-3-1194223436507) setting status to Failed 
>> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
>> angle4-agqcmmji-stdout.txt not found.
>> 2007-11-04 18:51:18,439-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
>> jobid=angle4-dgqcmmji - Application exception: The cache already 
>> contains localhost:awf4-20071104-1843-ds8hn11a/shared/cf0000.angle.
>> 2007-11-04 18:51:18,628-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
>> identity=urn:0-2-1194223436543) setting status to Failed 
>> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
>> angle4-dgqcmmji-stderr.txt not found.
>> 2007-11-04 18:51:18,762-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
>> identity=urn:0-2-1194223436550) setting status to Failed 
>> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
>> angle4-dgqcmmji-stdout.txt not found.
>> 2007-11-04 18:51:25,976-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
>> jobid=angle4-egqcmmji - Application exception: The cache already 
>> contains localhost:awf4-20071104-1843-ds8hn11a/shared/of0002.angle.
>> 2007-11-04 18:51:26,401-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
>> identity=urn:0-1-1194223436585) setting status to Failed 
>> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
>> angle4-egqcmmji-stderr.txt not found.
>> 2007-11-04 18:51:26,726-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
>> identity=urn:0-1-1194223436592) setting status to Failed 
>> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
>> angle4-egqcmmji-stdout.txt not found.
>> 2007-11-04 18:51:28,040-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
>> jobid=angle4-fgqcmmji - Application exception: The cache already 
>> contains localhost:awf4-20071104-1843-ds8hn11a/shared/cf0001.angle.
>> 2007-11-04 18:51:28,492-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
>> identity=urn:0-3-1194223436627) setting status to Failed 
>> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
>> angle4-fgqcmmji-stderr.txt not found.
>> 2007-11-04 18:51:28,816-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
>> identity=urn:0-3-1194223436634) setting status to Failed 
>> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
>> angle4-fgqcmmji-stdout.txt not found.
>> 2007-11-04 18:54:44,088-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
>> jobid=angle4-hgqcmmji - Application exception: The cache already 
>> contains localhost:awf4-20071104-1843-ds8hn11a/shared/of0002.angle.
>> 2007-11-04 18:54:44,440-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
>> identity=urn:0-1-1194223436670) setting status to Failed 
>> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
>> angle4-hgqcmmji-stderr.txt not found.
>> 2007-11-04 18:54:44,652-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
>> identity=urn:0-1-1194223436677) setting status to Failed 
>> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
>> angle4-hgqcmmji-stdout.txt not found.
>> 2007-11-04 18:54:44,741-0600 DEBUG VDL2ExecutionContext Exception in angle4:
>> Exception in angle4:
>>          sys:exception @ vdl-int.k, line: 423
>>          at 
>> org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29)
>> 2007-11-04 18:54:46,190-0600 INFO  ExecutionContext Detailed exception:
>> Exception in angle4:
>>          sys:exception @ vdl-int.k, line: 423
>>          at 
>> org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29)
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>
> 
> 


From hategan at mcs.anl.gov  Sun Nov  4 21:32:04 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 04 Nov 2007 21:32:04 -0600
Subject: [Swift-devel] Kickstart runs on localhost are failing
In-Reply-To: <472E8D59.8020206@mcs.anl.gov>
References: <472E73E1.9030607@mcs.anl.gov>
	<1194232070.6373.1.camel@blabla.mcs.anl.gov>
	<472E8D59.8020206@mcs.anl.gov>
Message-ID: <1194233524.7242.3.camel@blabla.mcs.anl.gov>


On Sun, 2007-11-04 at 21:26 -0600, Michael Wilde wrote:
> [resending to cc swift-devel]
> 
> On 11/4/07 9:07 PM, Mihael Hategan wrote:
> > On Sun, 2007-11-04 at 19:37 -0600, Michael Wilde wrote:
> >> I get job exceptions when I run with kickstart on localhost,
> >> regardless of whether clustered or not.
> >>
> >> The jobs seem to run (3x each) but fail each time. First time gets 
> >> "Application exception: Missing argument jobdir", 2nd & 3rd get 
> >> "Application exception: The cache already contains 
> >> localhost:awf4-20071104-1843-ds8hn11a..."
> > 
> > That probably shouldn't happen unless you're trying to assign to the
> > same variable twice. Does this work without kickstart?
> 
> Yes, it works without kickstart (r1453)
> Trying again on r1456.
> 
> It looked to me like the "cache already contains" error was a result of
> the first failure (which Ben thinks he's fixed in 1456 if I understand
> right) leaving the cache in a state where the retry gets confused.

I thought I made sure in some r that things are added to the cache
transactionally (i.e. when it's known that no bad things can happen).
Maybe I got something wrong.

> 
> I should note that in all these cases, I got all the output, so the job
> runs despite the first error, likely causing the duplicate cache entry
> problems.

Ah, I see. The failure occurs when dealing with kickstart which is after
the files are added to the cache. I did get something wrong.

> 
> - Mike
> 
> > 
> >> Clustered run is in run137, unclustered in run138
> >> The latter log dir has a file swiftdata.find.out which lists all the 
> >> files in my data dir (has a local/ branch at the top for localhost jobs).
> >>
> >> Error in both cases is below.
> >>
> >> Will try next doing kickstart in both ways via gram.
> >>
> >> - Mike
> >>
> >> 2007-11-04 18:47:40,946-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
> >> jobid=angle4-cgqcmmji - Application exception: Missing argument jobdir 
> >> for sys:element(rhost, wfdir, jobid, jobdir)
> >> 2007-11-04 18:47:41,085-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> >> identity=urn:0-2-1194223436415) setting status to Failed 
> >> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> >> angle4-cgqcmmji-stderr.txt not found.
> >> 2007-11-04 18:47:41,344-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> >> identity=urn:0-2-1194223436424) setting status to Failed 
> >> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> >> angle4-cgqcmmji-stdout.txt not found.
> >> 2007-11-04 18:47:41,503-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
> >> jobid=angle4-bgqcmmji - Application exception: Missing argument jobdir 
> >> for sys:element(rhost, wfdir, jobid, jobdir)
> >> 2007-11-04 18:47:41,553-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> >> identity=urn:0-1-1194223436458) setting status to Failed 
> >> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> >> angle4-bgqcmmji-stderr.txt not found.
> >> 2007-11-04 18:47:41,638-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> >> identity=urn:0-1-1194223436467) setting status to Failed 
> >> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> >> angle4-bgqcmmji-stdout.txt not found.
> >> 2007-11-04 18:47:41,882-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
> >> jobid=angle4-agqcmmji - Application exception: Missing argument jobdir 
> >> for sys:element(rhost, wfdir, jobid, jobdir)
> >> 2007-11-04 18:47:41,987-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> >> identity=urn:0-3-1194223436500) setting status to Failed 
> >> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> >> angle4-agqcmmji-stderr.txt not found.
> >> 2007-11-04 18:47:42,047-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> >> identity=urn:0-3-1194223436507) setting status to Failed 
> >> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> >> angle4-agqcmmji-stdout.txt not found.
> >> 2007-11-04 18:51:18,439-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
> >> jobid=angle4-dgqcmmji - Application exception: The cache already 
> >> contains localhost:awf4-20071104-1843-ds8hn11a/shared/cf0000.angle.
> >> 2007-11-04 18:51:18,628-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> >> identity=urn:0-2-1194223436543) setting status to Failed 
> >> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> >> angle4-dgqcmmji-stderr.txt not found.
> >> 2007-11-04 18:51:18,762-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> >> identity=urn:0-2-1194223436550) setting status to Failed 
> >> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> >> angle4-dgqcmmji-stdout.txt not found.
> >> 2007-11-04 18:51:25,976-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
> >> jobid=angle4-egqcmmji - Application exception: The cache already 
> >> contains localhost:awf4-20071104-1843-ds8hn11a/shared/of0002.angle.
> >> 2007-11-04 18:51:26,401-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> >> identity=urn:0-1-1194223436585) setting status to Failed 
> >> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> >> angle4-egqcmmji-stderr.txt not found.
> >> 2007-11-04 18:51:26,726-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> >> identity=urn:0-1-1194223436592) setting status to Failed 
> >> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> >> angle4-egqcmmji-stdout.txt not found.
> >> 2007-11-04 18:51:28,040-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
> >> jobid=angle4-fgqcmmji - Application exception: The cache already 
> >> contains localhost:awf4-20071104-1843-ds8hn11a/shared/cf0001.angle.
> >> 2007-11-04 18:51:28,492-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> >> identity=urn:0-3-1194223436627) setting status to Failed 
> >> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> >> angle4-fgqcmmji-stderr.txt not found.
> >> 2007-11-04 18:51:28,816-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> >> identity=urn:0-3-1194223436634) setting status to Failed 
> >> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> >> angle4-fgqcmmji-stdout.txt not found.
> >> 2007-11-04 18:54:44,088-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
> >> jobid=angle4-hgqcmmji - Application exception: The cache already 
> >> contains localhost:awf4-20071104-1843-ds8hn11a/shared/of0002.angle.
> >> 2007-11-04 18:54:44,440-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> >> identity=urn:0-1-1194223436670) setting status to Failed 
> >> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> >> angle4-hgqcmmji-stderr.txt not found.
> >> 2007-11-04 18:54:44,652-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> >> identity=urn:0-1-1194223436677) setting status to Failed 
> >> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> >> angle4-hgqcmmji-stdout.txt not found.
> >> 2007-11-04 18:54:44,741-0600 DEBUG VDL2ExecutionContext Exception in angle4:
> >> Exception in angle4:
> >>          sys:exception @ vdl-int.k, line: 423
> >>          at 
> >> org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29)
> >> 2007-11-04 18:54:46,190-0600 INFO  ExecutionContext Detailed exception:
> >> Exception in angle4:
> >>          sys:exception @ vdl-int.k, line: 423
> >>          at 
> >> org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29)
> >> _______________________________________________
> >> Swift-devel mailing list
> >> Swift-devel at ci.uchicago.edu
> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>
> > 
> > 
> 


From hategan at mcs.anl.gov  Sun Nov  4 21:37:17 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 04 Nov 2007 21:37:17 -0600
Subject: [Swift-devel] Kickstart runs on localhost are failing
In-Reply-To: <1194233524.7242.3.camel@blabla.mcs.anl.gov>
References: <472E73E1.9030607@mcs.anl.gov>
	<1194232070.6373.1.camel@blabla.mcs.anl.gov>
	<472E8D59.8020206@mcs.anl.gov>
	<1194233524.7242.3.camel@blabla.mcs.anl.gov>
Message-ID: <1194233838.7242.8.camel@blabla.mcs.anl.gov>

> > 
> > It looked to me like the "cache already contains" error was a result of
> > the first failure (which Ben thinks he's fixed in 1456 if I understand
> > right) leaving the cache in a state where the retry gets confused.
> 
> I thought I made sure in some r that things are added to the cache
> transactionally (i.e. when it's known that no bad things can happen).
> Maybe I got something wrong.
> 
> > 
> > I should note that in all these cases, I got all the output, so the job
> > runs despite the first error, likely causing the duplicate cache entry
> > problems.
> 
> Ah, I see. The failure occurs when dealing with kickstart which is after
> the files are added to the cache. I did get something wrong.

One solution would be to make kickstart transfer failure warnings
instead of them being thrown as exceptions (easy).

The other would be to only add the stageout files to the cache as the
last thing in the execute2 big try block. (very slightly harder).

Let me know which one you want.

Mihael

> 
> > 
> > - Mike
> > 
> > > 
> > >> Clustered run is in run137, unclustered in run138
> > >> The latter log dir has a file swiftdata.find.out which lists all the 
> > >> files in my data dir (has a local/ branch at the top for localhost jobs).
> > >>
> > >> Error in both cases is below.
> > >>
> > >> Will try next doing kickstart in both ways via gram.
> > >>
> > >> - Mike
> > >>
> > >> 2007-11-04 18:47:40,946-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
> > >> jobid=angle4-cgqcmmji - Application exception: Missing argument jobdir 
> > >> for sys:element(rhost, wfdir, jobid, jobdir)
> > >> 2007-11-04 18:47:41,085-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> > >> identity=urn:0-2-1194223436415) setting status to Failed 
> > >> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> > >> angle4-cgqcmmji-stderr.txt not found.
> > >> 2007-11-04 18:47:41,344-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> > >> identity=urn:0-2-1194223436424) setting status to Failed 
> > >> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> > >> angle4-cgqcmmji-stdout.txt not found.
> > >> 2007-11-04 18:47:41,503-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
> > >> jobid=angle4-bgqcmmji - Application exception: Missing argument jobdir 
> > >> for sys:element(rhost, wfdir, jobid, jobdir)
> > >> 2007-11-04 18:47:41,553-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> > >> identity=urn:0-1-1194223436458) setting status to Failed 
> > >> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> > >> angle4-bgqcmmji-stderr.txt not found.
> > >> 2007-11-04 18:47:41,638-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> > >> identity=urn:0-1-1194223436467) setting status to Failed 
> > >> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> > >> angle4-bgqcmmji-stdout.txt not found.
> > >> 2007-11-04 18:47:41,882-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
> > >> jobid=angle4-agqcmmji - Application exception: Missing argument jobdir 
> > >> for sys:element(rhost, wfdir, jobid, jobdir)
> > >> 2007-11-04 18:47:41,987-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> > >> identity=urn:0-3-1194223436500) setting status to Failed 
> > >> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> > >> angle4-agqcmmji-stderr.txt not found.
> > >> 2007-11-04 18:47:42,047-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> > >> identity=urn:0-3-1194223436507) setting status to Failed 
> > >> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> > >> angle4-agqcmmji-stdout.txt not found.
> > >> 2007-11-04 18:51:18,439-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
> > >> jobid=angle4-dgqcmmji - Application exception: The cache already 
> > >> contains localhost:awf4-20071104-1843-ds8hn11a/shared/cf0000.angle.
> > >> 2007-11-04 18:51:18,628-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> > >> identity=urn:0-2-1194223436543) setting status to Failed 
> > >> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> > >> angle4-dgqcmmji-stderr.txt not found.
> > >> 2007-11-04 18:51:18,762-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> > >> identity=urn:0-2-1194223436550) setting status to Failed 
> > >> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> > >> angle4-dgqcmmji-stdout.txt not found.
> > >> 2007-11-04 18:51:25,976-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
> > >> jobid=angle4-egqcmmji - Application exception: The cache already 
> > >> contains localhost:awf4-20071104-1843-ds8hn11a/shared/of0002.angle.
> > >> 2007-11-04 18:51:26,401-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> > >> identity=urn:0-1-1194223436585) setting status to Failed 
> > >> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> > >> angle4-egqcmmji-stderr.txt not found.
> > >> 2007-11-04 18:51:26,726-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> > >> identity=urn:0-1-1194223436592) setting status to Failed 
> > >> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> > >> angle4-egqcmmji-stdout.txt not found.
> > >> 2007-11-04 18:51:28,040-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
> > >> jobid=angle4-fgqcmmji - Application exception: The cache already 
> > >> contains localhost:awf4-20071104-1843-ds8hn11a/shared/cf0001.angle.
> > >> 2007-11-04 18:51:28,492-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> > >> identity=urn:0-3-1194223436627) setting status to Failed 
> > >> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> > >> angle4-fgqcmmji-stderr.txt not found.
> > >> 2007-11-04 18:51:28,816-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> > >> identity=urn:0-3-1194223436634) setting status to Failed 
> > >> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> > >> angle4-fgqcmmji-stdout.txt not found.
> > >> 2007-11-04 18:54:44,088-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
> > >> jobid=angle4-hgqcmmji - Application exception: The cache already 
> > >> contains localhost:awf4-20071104-1843-ds8hn11a/shared/of0002.angle.
> > >> 2007-11-04 18:54:44,440-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> > >> identity=urn:0-1-1194223436670) setting status to Failed 
> > >> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> > >> angle4-hgqcmmji-stderr.txt not found.
> > >> 2007-11-04 18:54:44,652-0600 DEBUG TaskImpl Task(type=FILE_OPERATION, 
> > >> identity=urn:0-1-1194223436677) setting status to Failed 
> > >> org.globus.cog.abstraction.impl.file.FileNotFoundException: 
> > >> angle4-hgqcmmji-stdout.txt not found.
> > >> 2007-11-04 18:54:44,741-0600 DEBUG VDL2ExecutionContext Exception in angle4:
> > >> Exception in angle4:
> > >>          sys:exception @ vdl-int.k, line: 423
> > >>          at 
> > >> org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29)
> > >> 2007-11-04 18:54:46,190-0600 INFO  ExecutionContext Detailed exception:
> > >> Exception in angle4:
> > >>          sys:exception @ vdl-int.k, line: 423
> > >>          at 
> > >> org.globus.cog.karajan.workflow.nodes.functions.KException.function(KException.java:29)
> > >> _______________________________________________
> > >> Swift-devel mailing list
> > >> Swift-devel at ci.uchicago.edu
> > >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > >>
> > > 
> > > 
> > 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From wilde at mcs.anl.gov  Sun Nov  4 21:38:44 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 04 Nov 2007 21:38:44 -0600
Subject: [Swift-devel] Jobs being aborted by PBS server
	on	tg-grid.uc.teragrid.org
In-Reply-To: <472E8C08.308@mcs.anl.gov>
References: <472E8C08.308@mcs.anl.gov>
Message-ID: <472E9044.9080600@mcs.anl.gov>

Ive reported this to TG and Ti on the chance that its on the server 
side. If nothing else, possibly a PBS log can pinpoint what we're doing 
wrong if its us or me.

The two runs below are in ~benc/swift-logs/wilde/
   7:46 PM - run142
   8:57 PM - run142

Ive started to add a 'comment' file to my log dirs there to note the 
reason, and on occasion I copy output placed in cwd to _output.
Also adding find or ls output to each dir when its relevant and I 
remember. Im trying to automate more of this as I go.

- Mike


On 11/4/07 9:20 PM, Michael Wilde wrote:
> Im starting to see more frequent problems like this.
> Happened once last night to 3 consecutive jobs, and tonight happened 
> twice, to 6 jobs.
> 
> Ti, could you look in the PBS logs, possibly on the related node(s) and 
> see if its looking like a problem on tg-uc or on our side?
> 
> Thanks,
> 
> Mike
> 
> 
> 11/3 8:05 PM - 3 failures
>  Job IDs 1571647, 48, & 49
> 11/4 7:46 PM - 3 failures
>  Job IDs 1572031, 33, & 34
> 11/4 8:56 - 8:57 PM
>  1572040, 42, 43
> 
> All errors have the format below.
> 
> Swift retries failing jobs 3 times, hence the groups of 3 above.
> 
> 
> -------- Original Message --------
> Subject: PBS JOB 1572043.tg-master.uc.teragrid.org
> Date: Sun,  4 Nov 2007 20:57:11 -0600 (CST)
> From: adm at tg-master.uc.teragrid.org (root)
> To: wilde at tg-grid1.uc.teragrid.org
> 
> PBS Job Id: 1572043.tg-master.uc.teragrid.org
> Job Name:   STDIN
> Aborted by PBS Server
> Job cannot be executed
> See Administrator for help
> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> 


From benc at hawaga.org.uk  Sun Nov  4 21:41:20 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 5 Nov 2007 03:41:20 +0000 (GMT)
Subject: [Swift-devel] Kickstart runs on localhost are failing
In-Reply-To: <1194233838.7242.8.camel@blabla.mcs.anl.gov>
References: <472E73E1.9030607@mcs.anl.gov>
	<1194232070.6373.1.camel@blabla.mcs.anl.gov>
	<472E8D59.8020206@mcs.anl.gov>
	<1194233524.7242.3.camel@blabla.mcs.anl.gov>
	<1194233838.7242.8.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0711050338540.20932@dildano.hawaga.org.uk>


On Sun, 4 Nov 2007, Mihael Hategan wrote:

> > Ah, I see. The failure occurs when dealing with kickstart which is after
> > the files are added to the cache. I did get something wrong.
> 
> One solution would be to make kickstart transfer failure warnings
> instead of them being thrown as exceptions (easy).

> The other would be to only add the stageout files to the cache as the
> last thing in the execute2 big try block. (very slightly harder).

I think warnings are preferable.

-- 


From hategan at mcs.anl.gov  Sun Nov  4 21:48:21 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 04 Nov 2007 21:48:21 -0600
Subject: [Swift-devel] Kickstart runs on localhost are failing
In-Reply-To: <Pine.LNX.4.64.0711050338540.20932@dildano.hawaga.org.uk>
References: <472E73E1.9030607@mcs.anl.gov>
	<1194232070.6373.1.camel@blabla.mcs.anl.gov>
	<472E8D59.8020206@mcs.anl.gov>
	<1194233524.7242.3.camel@blabla.mcs.anl.gov>
	<1194233838.7242.8.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0711050338540.20932@dildano.hawaga.org.uk>
Message-ID: <1194234501.7815.1.camel@blabla.mcs.anl.gov>


On Mon, 2007-11-05 at 03:41 +0000, Ben Clifford wrote:
> 
> On Sun, 4 Nov 2007, Mihael Hategan wrote:
> 
> > > Ah, I see. The failure occurs when dealing with kickstart which is after
> > > the files are added to the cache. I did get something wrong.
> > 
> > One solution would be to make kickstart transfer failure warnings
> > instead of them being thrown as exceptions (easy).
> 
> > The other would be to only add the stageout files to the cache as the
> > last thing in the execute2 big try block. (very slightly harder).
> 
> I think warnings are preferable.

Done (r1457). I have about 85% confidence that it will work as intended.

> 


From bugzilla-daemon at mcs.anl.gov  Sun Nov  4 21:59:17 2007
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Sun,  4 Nov 2007 21:59:17 -0600 (CST)
Subject: [Swift-devel] [Bug 36] maxwalltime specs
In-Reply-To: <bug-36-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20071105035917.340D9164BC@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=36


hategan at mcs.anl.gov changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #2 from hategan at mcs.anl.gov  2007-11-04 21:59 -------
Closing due to lack of further complaints after fix.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.


From bugzilla-daemon at mcs.anl.gov  Mon Nov  5 07:24:29 2007
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon,  5 Nov 2007 07:24:29 -0600 (CST)
Subject: [Swift-devel] [Bug 111] New: stage out -info and cluster logs in
	the same fashion as kickstart records.
Message-ID: <bug-111-21@http.bugzilla.mcs.anl.gov/swift/>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=111

           Summary: stage out -info and cluster logs in the same fashion as
                    kickstart records.
           Product: Swift
           Version: unspecified
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: General
        AssignedTo: benc at hawaga.org.uk
        ReportedBy: benc at hawaga.org.uk
                CC: swift-devel at ci.uchicago.edu


make staging of info, cluster and kickstart records consistent - at present,
there are unnecessarily different ways of getting at them.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


From benc at hawaga.org.uk  Mon Nov  5 07:25:20 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 5 Nov 2007 13:25:20 +0000 (GMT)
Subject: [Swift-devel] Kickstart on Angle vs not?
In-Reply-To: <1575686083-1194132464-cardhu_decombobulator_blackberry.rim.net-1246028922-@bxe030.bisx.prod.on.blackberry>
References: <1575686083-1194132464-cardhu_decombobulator_blackberry.rim.net-1246028922-@bxe030.bisx.prod.on.blackberry>
Message-ID: <Pine.LNX.4.64.0711051313170.21706@dildano.hawaga.org.uk>


On Sat, 3 Nov 2007, Ian Foster wrote:

> we use a different mechanbism to retrieve kickstart ouput vs. Log file 
> output, it seems. I'd be interested to understand them.

Given that the info records have been useful at least once, its probably 
useful to treat those, the kickstart records and the cluster logs all in 
the same fashion.

I put that in the bugzilla as bug 111.
-- 


From wilde at mcs.anl.gov  Mon Nov  5 07:48:44 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 05 Nov 2007 07:48:44 -0600
Subject: [Swift-devel] Kickstart on Angle vs not?
In-Reply-To: <Pine.LNX.4.64.0711051313170.21706@dildano.hawaga.org.uk>
References: <1575686083-1194132464-cardhu_decombobulator_blackberry.rim.net-1246028922-@bxe030.bisx.prod.on.blackberry>
	<Pine.LNX.4.64.0711051313170.21706@dildano.hawaga.org.uk>
Message-ID: <472F1F3C.6030004@mcs.anl.gov>


On 11/5/07 7:25 AM, Ben Clifford wrote:
> 
> On Sat, 3 Nov 2007, Ian Foster wrote:
> 
>> we use a different mechanbism to retrieve kickstart ouput vs. Log file 
>> output, it seems. I'd be interested to understand them.
> 
> Given that the info records have been useful at least once, its probably 
> useful to treat those, the kickstart records and the cluster logs all in 
> the same fashion.

Agreed. Should have an option to bring all (if feasible) or some (if too 
large) of this info back to submit host and store in log repository.

An optional ls -lR of the server tree is often helpful to add.

> I put that in the bugzilla as bug 111.


From benc at hawaga.org.uk  Mon Nov  5 07:50:57 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 5 Nov 2007 13:50:57 +0000 (GMT)
Subject: [Swift-devel] Kickstart on Angle vs not?
In-Reply-To: <472F1F3C.6030004@mcs.anl.gov>
References: <1575686083-1194132464-cardhu_decombobulator_blackberry.rim.net-1246028922-@bxe030.bisx.prod.on.blackberry>
	<Pine.LNX.4.64.0711051313170.21706@dildano.hawaga.org.uk>
	<472F1F3C.6030004@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0711051349400.21706@dildano.hawaga.org.uk>


On Mon, 5 Nov 2007, Michael Wilde wrote:

> An optional ls -lR of the server tree is often helpful to add.

Though touching directories on GPFS is likely to be expensive, I think - 
both for the job itself and for jobs simultaneouly running on other nodes.

-- 


From wilde at mcs.anl.gov  Mon Nov  5 09:27:24 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 05 Nov 2007 09:27:24 -0600
Subject: [Swift-devel] How best to distribute named input and outut files
	across dirs?
In-Reply-To: <Pine.LNX.4.64.0711011626580.21706@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0711011626580.21706@dildano.hawaga.org.uk>
Message-ID: <472F365C.7060907@mcs.anl.gov>

Whats the best way to spread output files across a directory if they are 
mapped, as opposed to anonymous?

In awf2.swift the outputs went into a single big dir (below _concurrent) 
because they are neither mapped nor members of an array.

In awf3.swift I switched to an array, and they were nicely (albeit 
verbosely ;) mapped to an array structure automatically.

In awf4.swift I name the outputs, and the files are now nicely named but 
all reside back in the client submit directory.

Now I want to make awf5, and spread named inputs and outputs across 
dirs. I recall suggesting a way to do this to Andrew, but didint track 
how he and you did it, Ben.

Andrew, can you send me your latest swift code?

Ben, Mihael, is the best way to do this to manually spread the inputs 
across a dirs, and map both the inputs and outputs using readdata?

angleinput/{00 through 99}/pcNNNN.pcap

angleout/{00 through 99}/ofNNNN.angle,cfNNNN.center}

I need to focus on a few admin things for a bit, but any/all advice is 
welcome.


::::::::::::::
awf2.swift
::::::::::::::
type pcapfile;
type angleout;
type anglecenter;

(angleout ofile, anglecenter cfile) angle4 (pcapfile ifile)
{
   app { angle4 @ifile @ofile @cfile; }
}

pcapfile pcapfiles[]<filesys_mapper; prefix="pc", suffix=".pcap">;

foreach pf in pcapfiles {
   angleout of;
   anglecenter cf;
   (of,cf) = angle4(pf);
}
::::::::::::::
awf3.swift
::::::::::::::
type pcapfile;
type angleout;
type anglecenter;

(angleout ofile, anglecenter cfile) angle4 (pcapfile ifile)
{
   app { angle4 @ifile @ofile @cfile; }
}

pcapfile pcapfiles[]<filesys_mapper; prefix="pc", suffix=".pcap">;

angleout of[];
anglecenter cf[];

foreach pf,i in pcapfiles {
   (of[i],cf[i]) = angle4(pf);
}
::::::::::::::
awf4.swift
::::::::::::::
type pcapfile;
type angleout;
type anglecenter;

(angleout ofile, anglecenter cfile) angle4 (pcapfile ifile)
{
   app { angle4 @ifile @ofile @cfile; }
}

pcapfile pcapfiles[]<filesys_mapper; prefix="pc", suffix=".pcap">;

angleout    of[] <simple_mapper;prefix="of",suffix=".angle">;
anglecenter cf[] <simple_mapper;prefix="cf",suffix=".center">;
                  // note i used .angle for both in current tests...

foreach pf,i in pcapfiles {
   (of[i],cf[i]) = angle4(pf);
}


On 11/1/07 11:57 AM, Ben Clifford wrote:
> I just modified the way that ConcurrentMapper lays out files (r1437)
> 
> You will likely not have encountered ConcurrentMapper by name. It is used 
> when you do not specify a mapper for a dataset, for example for 
> intermediate variables.
> 
> Previously, all files named by this mapper were given a long name in the 
> root directory of the submit and cache directories.
> 
> When a large number of files were named in this fashion, for example in an 
> array with thousands of elements, this would result in a file for each 
> element and a root directory with thousands of files.
> 
> Most immediately I encountered this problem working with Andrew Jamieson 
> running on TeraPort using GPFS. Many hosts attempting to access one 
> directory is severely unscalable on GPFS.
> 
> The changes I have made add more structure to filenames generated by the 
> ConcurrentMapper:
> 
> 
>  1. All files appear in a _concurrent/ subdirectory.
> 
> 
>  2. Simple/marker data typed files appear directly below _concurrent, 
> named as before. For example:
> 
>   file outfile;
> 
> might give a filename:
> 
>   _concurrent//outfile-3339612a-08e1-443d-bd14-2329080d2d94-
> 
> 
>  3. Structures are mapped to a sub-directory, with each element being a 
> file in that subdirectory. For example,
> 
>  type footype { file left; file right; }
>  footype structurefile;
> 
> might give a directory:
> 
> _concurrent//structurefile-c68b99dc-de3c-4288-822f-2ab3d4dc6427--field
> 
> containing two files:
> 
> _concurrent//structurefile-c68b99dc-de3c-4288-822f-2ab3d4dc6427--field/left
> _concurrent//structurefile-c68b99dc-de3c-4288-822f-2ab3d4dc6427--field/right
> 
> 
> 4. Array elements are placed in a subdirectory. Within that subdirectory, 
> the index is using to construct a further hierarchy such that there will 
> never be more than 50 directories/files in any one directory. For example:
> 
>   file manyfile[];
> 
> might give mappings like this:
> 
> myfile[0] stored in:
>  _concurrent//manyfile-0b91d809-37f5-46da-91c8-6c4a9157b06b--array/elt-0
> 
> myfile[22] stored in:
>  _concurrent//manyfile-0b91d809-37f5-46da-91c8-6c4a9157b06b--array/elt-22
> 
> myfile[30] stored in:
>  _concurrent//manyfile-0b91d809-37f5-46da-91c8-6c4a9157b06b--array/h5/elt-30
> 
> myfile[734] stored in:
>  _concurrent//manyfile-bcdeedee-4df7-4d21-a207-d8051da3d133--array/h9/h4/elt-734
> 
> To form the paths, basically something like this happens:
> convert each number into base 25. discard the most significant digit. 
> then starting at the least significant digit and working towards 
> the most significant digit, make that digit into a subdirectory.
> 
> For example, 734 in base 10 is  (1) (4) (9) in base 25
> 
> so we form intermediate path /h9/h4/
> 
> Doing this means that for large arrays directory paths will grow, whilst 
> for small arrays will be short; and the size of the array does not need to 
> be known ahead of time.
> 
> The constant '25' can easily be adjusted. Its a compiled-in constant 
> defined in one place at the moment, but could be made into a mapper 
> parameter.
> 


From andrewj at uchicago.edu  Mon Nov  5 09:55:12 2007
From: andrewj at uchicago.edu (Andrew Robert Jamieson)
Date: Mon, 5 Nov 2007 09:55:12 -0600 (CST)
Subject: [Swift-devel] How best to distribute named input and outut files
	across dirs?
In-Reply-To: <472F365C.7060907@mcs.anl.gov>
References: <Pine.LNX.4.64.0711011626580.21706@dildano.hawaga.org.uk>
	<472F365C.7060907@mcs.anl.gov>
Message-ID: <Pine.GSO.4.62.0711050940150.17753@harper.uchicago.edu>

Hey Mike and others,

   I used that splitting bash script to separate the files into 
subdirectories.

Then I used that other script you helped me with to find where I put those 
files.  This script generated the .csv which was then read by the csv 
mapper.

Nothing fancy.

-Andrew


On Mon, 5 Nov 2007, Michael Wilde wrote:

> Whats the best way to spread output files across a directory if they are 
> mapped, as opposed to anonymous?
>
> In awf2.swift the outputs went into a single big dir (below _concurrent) 
> because they are neither mapped nor members of an array.
>
> In awf3.swift I switched to an array, and they were nicely (albeit verbosely 
> ;) mapped to an array structure automatically.
>
> In awf4.swift I name the outputs, and the files are now nicely named but all 
> reside back in the client submit directory.
>
> Now I want to make awf5, and spread named inputs and outputs across dirs. I 
> recall suggesting a way to do this to Andrew, but didint track how he and you 
> did it, Ben.
>
> Andrew, can you send me your latest swift code?
>
> Ben, Mihael, is the best way to do this to manually spread the inputs across 
> a dirs, and map both the inputs and outputs using readdata?
>
> angleinput/{00 through 99}/pcNNNN.pcap
>
> angleout/{00 through 99}/ofNNNN.angle,cfNNNN.center}
>
> I need to focus on a few admin things for a bit, but any/all advice is 
> welcome.
>
>
>
> ::::::::::::::
> awf2.swift
> ::::::::::::::
> type pcapfile;
> type angleout;
> type anglecenter;
>
> (angleout ofile, anglecenter cfile) angle4 (pcapfile ifile)
> {
>  app { angle4 @ifile @ofile @cfile; }
> }
>
> pcapfile pcapfiles[]<filesys_mapper; prefix="pc", suffix=".pcap">;
>
> foreach pf in pcapfiles {
>  angleout of;
>  anglecenter cf;
>  (of,cf) = angle4(pf);
> }
> ::::::::::::::
> awf3.swift
> ::::::::::::::
> type pcapfile;
> type angleout;
> type anglecenter;
>
> (angleout ofile, anglecenter cfile) angle4 (pcapfile ifile)
> {
>  app { angle4 @ifile @ofile @cfile; }
> }
>
> pcapfile pcapfiles[]<filesys_mapper; prefix="pc", suffix=".pcap">;
>
> angleout of[];
> anglecenter cf[];
>
> foreach pf,i in pcapfiles {
>  (of[i],cf[i]) = angle4(pf);
> }
> ::::::::::::::
> awf4.swift
> ::::::::::::::
> type pcapfile;
> type angleout;
> type anglecenter;
>
> (angleout ofile, anglecenter cfile) angle4 (pcapfile ifile)
> {
>  app { angle4 @ifile @ofile @cfile; }
> }
>
> pcapfile pcapfiles[]<filesys_mapper; prefix="pc", suffix=".pcap">;
>
> angleout    of[] <simple_mapper;prefix="of",suffix=".angle">;
> anglecenter cf[] <simple_mapper;prefix="cf",suffix=".center">;
>                 // note i used .angle for both in current tests...
>
> foreach pf,i in pcapfiles {
>  (of[i],cf[i]) = angle4(pf);
> }
>
>
>
> On 11/1/07 11:57 AM, Ben Clifford wrote:
>> I just modified the way that ConcurrentMapper lays out files (r1437)
>> 
>> You will likely not have encountered ConcurrentMapper by name. It is used 
>> when you do not specify a mapper for a dataset, for example for 
>> intermediate variables.
>> 
>> Previously, all files named by this mapper were given a long name in the 
>> root directory of the submit and cache directories.
>> 
>> When a large number of files were named in this fashion, for example in an 
>> array with thousands of elements, this would result in a file for each 
>> element and a root directory with thousands of files.
>> 
>> Most immediately I encountered this problem working with Andrew Jamieson 
>> running on TeraPort using GPFS. Many hosts attempting to access one 
>> directory is severely unscalable on GPFS.
>> 
>> The changes I have made add more structure to filenames generated by the 
>> ConcurrentMapper:
>> 
>> 
>>  1. All files appear in a _concurrent/ subdirectory.
>> 
>> 
>>  2. Simple/marker data typed files appear directly below _concurrent, named 
>> as before. For example:
>> 
>>   file outfile;
>> 
>> might give a filename:
>> 
>>   _concurrent//outfile-3339612a-08e1-443d-bd14-2329080d2d94-
>> 
>> 
>>  3. Structures are mapped to a sub-directory, with each element being a 
>> file in that subdirectory. For example,
>> 
>>  type footype { file left; file right; }
>>  footype structurefile;
>> 
>> might give a directory:
>> 
>> _concurrent//structurefile-c68b99dc-de3c-4288-822f-2ab3d4dc6427--field
>> 
>> containing two files:
>> 
>> _concurrent//structurefile-c68b99dc-de3c-4288-822f-2ab3d4dc6427--field/left
>> _concurrent//structurefile-c68b99dc-de3c-4288-822f-2ab3d4dc6427--field/right
>> 
>> 
>> 4. Array elements are placed in a subdirectory. Within that subdirectory, 
>> the index is using to construct a further hierarchy such that there will 
>> never be more than 50 directories/files in any one directory. For example:
>> 
>>   file manyfile[];
>> 
>> might give mappings like this:
>> 
>> myfile[0] stored in:
>>  _concurrent//manyfile-0b91d809-37f5-46da-91c8-6c4a9157b06b--array/elt-0
>> 
>> myfile[22] stored in:
>>  _concurrent//manyfile-0b91d809-37f5-46da-91c8-6c4a9157b06b--array/elt-22
>> 
>> myfile[30] stored in:
>>  _concurrent//manyfile-0b91d809-37f5-46da-91c8-6c4a9157b06b--array/h5/elt-30
>> 
>> myfile[734] stored in:
>>  _concurrent//manyfile-bcdeedee-4df7-4d21-a207-d8051da3d133--array/h9/h4/elt-734
>> 
>> To form the paths, basically something like this happens:
>> convert each number into base 25. discard the most significant digit. then 
>> starting at the least significant digit and working towards the most 
>> significant digit, make that digit into a subdirectory.
>> 
>> For example, 734 in base 10 is  (1) (4) (9) in base 25
>> 
>> so we form intermediate path /h9/h4/
>> 
>> Doing this means that for large arrays directory paths will grow, whilst 
>> for small arrays will be short; and the size of the array does not need to 
>> be known ahead of time.
>> 
>> The constant '25' can easily be adjusted. Its a compiled-in constant 
>> defined in one place at the moment, but could be made into a mapper 
>> parameter.
>> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>


From andrewj at uchicago.edu  Mon Nov  5 09:58:08 2007
From: andrewj at uchicago.edu (Andrew Robert Jamieson)
Date: Mon, 5 Nov 2007 09:58:08 -0600 (CST)
Subject: [Swift-devel] How best to distribute named input and outut files
	across dirs?
In-Reply-To: <472F365C.7060907@mcs.anl.gov>
References: <Pine.LNX.4.64.0711011626580.21706@dildano.hawaga.org.uk>
	<472F365C.7060907@mcs.anl.gov>
Message-ID: <Pine.GSO.4.62.0711050957290.17753@harper.uchicago.edu>

Latest swift code should be in ~/CADGrid/Swifty/
or ~/CADGrid/Swifty/skynet-swift-runs/


On Mon, 5 Nov 2007, Michael Wilde wrote:

> Whats the best way to spread output files across a directory if they are 
> mapped, as opposed to anonymous?
>
> In awf2.swift the outputs went into a single big dir (below _concurrent) 
> because they are neither mapped nor members of an array.
>
> In awf3.swift I switched to an array, and they were nicely (albeit verbosely 
> ;) mapped to an array structure automatically.
>
> In awf4.swift I name the outputs, and the files are now nicely named but all 
> reside back in the client submit directory.
>
> Now I want to make awf5, and spread named inputs and outputs across dirs. I 
> recall suggesting a way to do this to Andrew, but didint track how he and you 
> did it, Ben.
>
> Andrew, can you send me your latest swift code?
>
> Ben, Mihael, is the best way to do this to manually spread the inputs across 
> a dirs, and map both the inputs and outputs using readdata?
>
> angleinput/{00 through 99}/pcNNNN.pcap
>
> angleout/{00 through 99}/ofNNNN.angle,cfNNNN.center}
>
> I need to focus on a few admin things for a bit, but any/all advice is 
> welcome.
>
>
>
> ::::::::::::::
> awf2.swift
> ::::::::::::::
> type pcapfile;
> type angleout;
> type anglecenter;
>
> (angleout ofile, anglecenter cfile) angle4 (pcapfile ifile)
> {
>  app { angle4 @ifile @ofile @cfile; }
> }
>
> pcapfile pcapfiles[]<filesys_mapper; prefix="pc", suffix=".pcap">;
>
> foreach pf in pcapfiles {
>  angleout of;
>  anglecenter cf;
>  (of,cf) = angle4(pf);
> }
> ::::::::::::::
> awf3.swift
> ::::::::::::::
> type pcapfile;
> type angleout;
> type anglecenter;
>
> (angleout ofile, anglecenter cfile) angle4 (pcapfile ifile)
> {
>  app { angle4 @ifile @ofile @cfile; }
> }
>
> pcapfile pcapfiles[]<filesys_mapper; prefix="pc", suffix=".pcap">;
>
> angleout of[];
> anglecenter cf[];
>
> foreach pf,i in pcapfiles {
>  (of[i],cf[i]) = angle4(pf);
> }
> ::::::::::::::
> awf4.swift
> ::::::::::::::
> type pcapfile;
> type angleout;
> type anglecenter;
>
> (angleout ofile, anglecenter cfile) angle4 (pcapfile ifile)
> {
>  app { angle4 @ifile @ofile @cfile; }
> }
>
> pcapfile pcapfiles[]<filesys_mapper; prefix="pc", suffix=".pcap">;
>
> angleout    of[] <simple_mapper;prefix="of",suffix=".angle">;
> anglecenter cf[] <simple_mapper;prefix="cf",suffix=".center">;
>                 // note i used .angle for both in current tests...
>
> foreach pf,i in pcapfiles {
>  (of[i],cf[i]) = angle4(pf);
> }
>
>
>
> On 11/1/07 11:57 AM, Ben Clifford wrote:
>> I just modified the way that ConcurrentMapper lays out files (r1437)
>> 
>> You will likely not have encountered ConcurrentMapper by name. It is used 
>> when you do not specify a mapper for a dataset, for example for 
>> intermediate variables.
>> 
>> Previously, all files named by this mapper were given a long name in the 
>> root directory of the submit and cache directories.
>> 
>> When a large number of files were named in this fashion, for example in an 
>> array with thousands of elements, this would result in a file for each 
>> element and a root directory with thousands of files.
>> 
>> Most immediately I encountered this problem working with Andrew Jamieson 
>> running on TeraPort using GPFS. Many hosts attempting to access one 
>> directory is severely unscalable on GPFS.
>> 
>> The changes I have made add more structure to filenames generated by the 
>> ConcurrentMapper:
>> 
>> 
>>  1. All files appear in a _concurrent/ subdirectory.
>> 
>> 
>>  2. Simple/marker data typed files appear directly below _concurrent, named 
>> as before. For example:
>> 
>>   file outfile;
>> 
>> might give a filename:
>> 
>>   _concurrent//outfile-3339612a-08e1-443d-bd14-2329080d2d94-
>> 
>> 
>>  3. Structures are mapped to a sub-directory, with each element being a 
>> file in that subdirectory. For example,
>> 
>>  type footype { file left; file right; }
>>  footype structurefile;
>> 
>> might give a directory:
>> 
>> _concurrent//structurefile-c68b99dc-de3c-4288-822f-2ab3d4dc6427--field
>> 
>> containing two files:
>> 
>> _concurrent//structurefile-c68b99dc-de3c-4288-822f-2ab3d4dc6427--field/left
>> _concurrent//structurefile-c68b99dc-de3c-4288-822f-2ab3d4dc6427--field/right
>> 
>> 
>> 4. Array elements are placed in a subdirectory. Within that subdirectory, 
>> the index is using to construct a further hierarchy such that there will 
>> never be more than 50 directories/files in any one directory. For example:
>> 
>>   file manyfile[];
>> 
>> might give mappings like this:
>> 
>> myfile[0] stored in:
>>  _concurrent//manyfile-0b91d809-37f5-46da-91c8-6c4a9157b06b--array/elt-0
>> 
>> myfile[22] stored in:
>>  _concurrent//manyfile-0b91d809-37f5-46da-91c8-6c4a9157b06b--array/elt-22
>> 
>> myfile[30] stored in:
>>  _concurrent//manyfile-0b91d809-37f5-46da-91c8-6c4a9157b06b--array/h5/elt-30
>> 
>> myfile[734] stored in:
>>  _concurrent//manyfile-bcdeedee-4df7-4d21-a207-d8051da3d133--array/h9/h4/elt-734
>> 
>> To form the paths, basically something like this happens:
>> convert each number into base 25. discard the most significant digit. then 
>> starting at the least significant digit and working towards the most 
>> significant digit, make that digit into a subdirectory.
>> 
>> For example, 734 in base 10 is  (1) (4) (9) in base 25
>> 
>> so we form intermediate path /h9/h4/
>> 
>> Doing this means that for large arrays directory paths will grow, whilst 
>> for small arrays will be short; and the size of the array does not need to 
>> be known ahead of time.
>> 
>> The constant '25' can easily be adjusted. Its a compiled-in constant 
>> defined in one place at the moment, but could be made into a mapper 
>> parameter.
>> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>


From benc at hawaga.org.uk  Mon Nov  5 10:54:03 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 5 Nov 2007 16:54:03 +0000 (GMT)
Subject: [Swift-devel] Re: How best to distribute named input and outut
	files across dirs?
In-Reply-To: <472F365C.7060907@mcs.anl.gov>
References: <Pine.LNX.4.64.0711011626580.21706@dildano.hawaga.org.uk>
	<472F365C.7060907@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0711051651470.21706@dildano.hawaga.org.uk>


I'm confused by your use of the concurrent mapper with the word 'output' - 
anything appearing under _concurrent is rather arbitrarily named.

For inputs, how are you specifying input mapping at the moment? Can you 
give the mapper declaration you use for inputs?

For outputs, some ideas:

  i) explicitly map output paths using the CSV mapper or execution mapper.
 ii) write a custom mapper or have one of us do it that has more 
     hierarchical behaviour.


On Mon, 5 Nov 2007, Michael Wilde wrote:

> Whats the best way to spread output files across a directory if they are
> mapped, as opposed to anonymous?
> 
> In awf2.swift the outputs went into a single big dir (below _concurrent)
> because they are neither mapped nor members of an array.
> 
> In awf3.swift I switched to an array, and they were nicely (albeit verbosely
> ;) mapped to an array structure automatically.
> 
> In awf4.swift I name the outputs, and the files are now nicely named but all
> reside back in the client submit directory.
> 
> Now I want to make awf5, and spread named inputs and outputs across dirs. I
> recall suggesting a way to do this to Andrew, but didint track how he and you
> did it, Ben.
> 
> Andrew, can you send me your latest swift code?
> 
> Ben, Mihael, is the best way to do this to manually spread the inputs across a
> dirs, and map both the inputs and outputs using readdata?
> 
> angleinput/{00 through 99}/pcNNNN.pcap
> 
> angleout/{00 through 99}/ofNNNN.angle,cfNNNN.center}
> 
> I need to focus on a few admin things for a bit, but any/all advice is
> welcome.
> 
> 
> 
> ::::::::::::::
> awf2.swift
> ::::::::::::::
> type pcapfile;
> type angleout;
> type anglecenter;
> 
> (angleout ofile, anglecenter cfile) angle4 (pcapfile ifile)
> {
>   app { angle4 @ifile @ofile @cfile; }
> }
> 
> pcapfile pcapfiles[]<filesys_mapper; prefix="pc", suffix=".pcap">;
> 
> foreach pf in pcapfiles {
>   angleout of;
>   anglecenter cf;
>   (of,cf) = angle4(pf);
> }
> ::::::::::::::
> awf3.swift
> ::::::::::::::
> type pcapfile;
> type angleout;
> type anglecenter;
> 
> (angleout ofile, anglecenter cfile) angle4 (pcapfile ifile)
> {
>   app { angle4 @ifile @ofile @cfile; }
> }
> 
> pcapfile pcapfiles[]<filesys_mapper; prefix="pc", suffix=".pcap">;
> 
> angleout of[];
> anglecenter cf[];
> 
> foreach pf,i in pcapfiles {
>   (of[i],cf[i]) = angle4(pf);
> }
> ::::::::::::::
> awf4.swift
> ::::::::::::::
> type pcapfile;
> type angleout;
> type anglecenter;
> 
> (angleout ofile, anglecenter cfile) angle4 (pcapfile ifile)
> {
>   app { angle4 @ifile @ofile @cfile; }
> }
> 
> pcapfile pcapfiles[]<filesys_mapper; prefix="pc", suffix=".pcap">;
> 
> angleout    of[] <simple_mapper;prefix="of",suffix=".angle">;
> anglecenter cf[] <simple_mapper;prefix="cf",suffix=".center">;
>                  // note i used .angle for both in current tests...
> 
> foreach pf,i in pcapfiles {
>   (of[i],cf[i]) = angle4(pf);
> }
> 
> 
> 
> On 11/1/07 11:57 AM, Ben Clifford wrote:
> > I just modified the way that ConcurrentMapper lays out files (r1437)
> > 
> > You will likely not have encountered ConcurrentMapper by name. It is used
> > when you do not specify a mapper for a dataset, for example for intermediate
> > variables.
> > 
> > Previously, all files named by this mapper were given a long name in the
> > root directory of the submit and cache directories.
> > 
> > When a large number of files were named in this fashion, for example in an
> > array with thousands of elements, this would result in a file for each
> > element and a root directory with thousands of files.
> > 
> > Most immediately I encountered this problem working with Andrew Jamieson
> > running on TeraPort using GPFS. Many hosts attempting to access one
> > directory is severely unscalable on GPFS.
> > 
> > The changes I have made add more structure to filenames generated by the
> > ConcurrentMapper:
> > 
> > 
> >  1. All files appear in a _concurrent/ subdirectory.
> > 
> > 
> >  2. Simple/marker data typed files appear directly below _concurrent, named
> > as before. For example:
> > 
> >   file outfile;
> > 
> > might give a filename:
> > 
> >   _concurrent//outfile-3339612a-08e1-443d-bd14-2329080d2d94-
> > 
> > 
> >  3. Structures are mapped to a sub-directory, with each element being a file
> > in that subdirectory. For example,
> > 
> >  type footype { file left; file right; }
> >  footype structurefile;
> > 
> > might give a directory:
> > 
> > _concurrent//structurefile-c68b99dc-de3c-4288-822f-2ab3d4dc6427--field
> > 
> > containing two files:
> > 
> > _concurrent//structurefile-c68b99dc-de3c-4288-822f-2ab3d4dc6427--field/left
> > _concurrent//structurefile-c68b99dc-de3c-4288-822f-2ab3d4dc6427--field/right
> > 
> > 
> > 4. Array elements are placed in a subdirectory. Within that subdirectory,
> > the index is using to construct a further hierarchy such that there will
> > never be more than 50 directories/files in any one directory. For example:
> > 
> >   file manyfile[];
> > 
> > might give mappings like this:
> > 
> > myfile[0] stored in:
> >  _concurrent//manyfile-0b91d809-37f5-46da-91c8-6c4a9157b06b--array/elt-0
> > 
> > myfile[22] stored in:
> >  _concurrent//manyfile-0b91d809-37f5-46da-91c8-6c4a9157b06b--array/elt-22
> > 
> > myfile[30] stored in:
> >  _concurrent//manyfile-0b91d809-37f5-46da-91c8-6c4a9157b06b--array/h5/elt-30
> > 
> > myfile[734] stored in:
> >  _concurrent//manyfile-bcdeedee-4df7-4d21-a207-d8051da3d133--array/h9/h4/elt-734
> > 
> > To form the paths, basically something like this happens:
> > convert each number into base 25. discard the most significant digit. then
> > starting at the least significant digit and working towards the most
> > significant digit, make that digit into a subdirectory.
> > 
> > For example, 734 in base 10 is  (1) (4) (9) in base 25
> > 
> > so we form intermediate path /h9/h4/
> > 
> > Doing this means that for large arrays directory paths will grow, whilst for
> > small arrays will be short; and the size of the array does not need to be
> > known ahead of time.
> > 
> > The constant '25' can easily be adjusted. Its a compiled-in constant defined
> > in one place at the moment, but could be made into a mapper parameter.
> > 
> 
> 


From benc at hawaga.org.uk  Mon Nov  5 11:06:07 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 5 Nov 2007 17:06:07 +0000 (GMT)
Subject: [Swift-devel] Re: How best to distribute named input and outut
	files across dirs?
In-Reply-To: <472F365C.7060907@mcs.anl.gov>
References: <Pine.LNX.4.64.0711011626580.21706@dildano.hawaga.org.uk>
	<472F365C.7060907@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0711051704540.21706@dildano.hawaga.org.uk>


On Mon, 5 Nov 2007, Michael Wilde wrote:

> In awf3.swift I switched to an array, and they were nicely (albeit verbosely
> ;) mapped to an array structure automatically.
> 
> In awf4.swift I name the outputs, and the files are now nicely named but all
> reside back in the client submit directory.
> 
> Now I want to make awf5, and spread named inputs and outputs across dirs. I
> recall suggesting a way to do this to Andrew, but didint track how he and you
> did it, Ben.

If you're explicitly naming ouputs in awf4, you can explicitly name them 
in awf5 too. Put '/' symbols in the filenames to indicate directory cuts, 
like in URIs or filenames.

-- 


From wilde at mcs.anl.gov  Mon Nov  5 18:46:00 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 05 Nov 2007 18:46:00 -0600
Subject: [Swift-devel] slow swift startup time
Message-ID: <472FB948.8010605@mcs.anl.gov>

Im running on tg-login.uc.teragrid.org (IA64) and seeing very long swift 
start times.

My swift command wrapper prints the wf start and end times with the 
swift stdout sandwiched in between.  Here's an example of those, 
followed by the swift log file. In this run, i start swift in the 
background, then tail the stdout file. it was about 70 seconds (on my 
watch) before swift responded with its initial messages on stdout. (I 
dont think its being buffered, but thats worth checking...)

Note that swift was launched at 18:30:49 and its logfile entry with the 
runid came at 18:32:05.  32:05-30:49 = 76 seconds!

This was swift 1456 compiled on terminable (or login, i forget).

Suspicious: when I was running a version compiled in tg-login under Java 
1.4 I would get an error message from a Java method trying to lock the 
log file. Not sure if this logging action (which now does not give a 
message) is related to this slow start time.

- Mike

UC64$ cat swift.out
Swift Script local-noks.swift starting at Mon Nov 5 18:30:49 CST 2007

Swift v0.3-dev r1456

RunID: 20071105-1831-d7t5l2n3
angle4 started
angle4 started
angle4 started
angle4 started
angle4 started
angle4 completed
angle4 completed
angle4 completed
angle4 completed
angle4 completed

Swift Script local-noks.swift ended at Mon Nov 5 18:32:34 CST 2007 with 
exit code 0


UC64$ head awf*.log
2007-11-05 18:31:09,566-0600 INFO  Loader awf6.swift: source file is 
new. Recompiling.
2007-11-05 18:31:30,454-0600 INFO  Karajan Validation of XML 
intermediate file was successful
2007-11-05 18:31:55,465-0600 INFO  unknown Using sites file: ./sites.xml
2007-11-05 18:31:55,466-0600 INFO  unknown Using tc.data: ./tc.data
2007-11-05 18:32:05,518-0600 INFO  unknown Swift v0.3-dev r1456

2007-11-05 18:32:05,520-0600 INFO  unknown RunID: 20071105-1831-d7t5l2n3
2007-11-05 18:32:06,701-0600 INFO  AbstractDataNode Found data 
pcapfiles.$[]/1.[0]
2007-11-05 18:32:06,702-0600 INFO  AbstractDataNode Found data 
pcapfiles.$[]/2.[1]
2007-11-05 18:32:06,702-0600 INFO  AbstractDataNode Found data 
pcapfiles.$[]/3.[2]
UC64$


From wilde at mcs.anl.gov  Mon Nov  5 18:56:45 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 05 Nov 2007 18:56:45 -0600
Subject: [Swift-devel] slow swift startup time
In-Reply-To: <472FB948.8010605@mcs.anl.gov>
References: <472FB948.8010605@mcs.anl.gov>
Message-ID: <472FBBCD.3020908@mcs.anl.gov>

The lock error I was referring to is this, on stdout/err:

   Failed to acquire exclusive lock on log file.

Below is the log file text that accompanied it.

Note that Im not complaining about this error - it went away when I 
started compiling on terminable again.

Im just pointing it out as a suspect in the slow startup. It surprised 
me that we bother to lock the logfile, unless Java is gratuituously doig 
it for us.

- Mike

Logfile head showed:

2007-11-04 10:15:28,949-0600 INFO  Loader awf3.swift: source file is 
new. Recompiling.
2007-11-04 10:15:31,315-0600 INFO  Karajan Validation of XML 
intermediate file was successful
2007-11-04 10:15:34,597-0600 INFO  unknown Using sites file: ./sites.xml
2007-11-04 10:15:34,599-0600 INFO  unknown Using tc.data: ./tc.data
2007-11-04 10:15:36,606-0600 INFO  unknown Swift v0.3-dev 
libexec/svn-revision: line 1: svn: command not found
libexec/svn-revision: line 1: svn: command not found


2007-11-04 10:15:36,607-0600 INFO  unknown RunID: 20071104-1015-afgc18i3
2007-11-04 10:15:36,825-0600 INFO  AbstractDataNode Found data 
pcapfiles.$[]/1.[0]
2007-11-04 10:15:36,825-0600 INFO  AbstractDataNode Found data 
pcapfiles.$[]/2.[1]
2007-11-04 10:15:36,825-0600 INFO  AbstractDataNode Found data 
pcapfiles.$[]/3.[2]
2007-11-04 10:15:36,825-0600 INFO  AbstractDataNode Found data 
pcapfiles.$[]/4.[3]
2007-11-04 10:15:36,825-0600 INFO  AbstractDataNode Found data 
pcapfiles.$[]/5.[4]
2007-11-04 10:16:26,917-0600 INFO  FlushableLockedFileWriter Could not 
acquire lock on /home/wilde/angle/data/./awf3-20071104-1015
-afgc18i3.0.rlog
java.io.IOException: No locks available
         at sun.nio.ch.FileChannelImpl.lock0(Native Method)
         at sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:804)
         at java.nio.channels.FileChannel.tryLock(FileChannel.java:983)
         at 
org.globus.cog.karajan.workflow.nodes.restartLog.FlushableLockedFileWriter.<init>(FlushableLockedFileWriter.java:26)
         at 
org.globus.cog.karajan.workflow.nodes.restartLog.RestartLog.create(RestartLog.java:123)
         at 
org.globus.cog.karajan.workflow.nodes.restartLog.RestartLog.partialArgumentsEvaluated(RestartLog.java:55)
         at 
org.globus.cog.karajan.workflow.nodes.PartialArgumentsContainer.executeChildren(PartialArgumentsContainer.java:51)
         at 
org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java(Compiled 
Code))
         at 
org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java(Compiled 
Code))
         at 
org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java(Inlined 
Compiled Code))
         at 
org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java(Compiled 
Code))
         at 
org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java(Compiled 
Code))
         at 
org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java(Compiled 
Code))
         at 
org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java(Compiled 
Code))
         at 
org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java(Inlined 
Compiled Code))
         at 
org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java(Compiled 
Code))
2007-11-04 10:16:26,921-0600 WARN  RestartLog Failed to acquire 
exclusive lock on log file.
2007-11-04 10:16:26,999-0600 INFO  vdl:execute START thread=0-4 tr=angle4
2007-11-04 10:16:26,999-0600 INFO  vdl:execute START thread=0-1 tr=angle4
2007-11-04 10:16:26,999-0600 INFO  vdl:execute START thread=0-5 tr=angle4
2007-11-04 10:16:27,000-0600 INFO  vdl:execute START thread=0-3 tr=angle4
2007-11-04 10:16:27,001-0600 INFO  vdl:execute START thread=0-2 tr=angle4


On 11/5/07 6:46 PM, Michael Wilde wrote:
> Im running on tg-login.uc.teragrid.org (IA64) and seeing very long swift 
> start times.
> 
> My swift command wrapper prints the wf start and end times with the 
> swift stdout sandwiched in between.  Here's an example of those, 
> followed by the swift log file. In this run, i start swift in the 
> background, then tail the stdout file. it was about 70 seconds (on my 
> watch) before swift responded with its initial messages on stdout. (I 
> dont think its being buffered, but thats worth checking...)
> 
> Note that swift was launched at 18:30:49 and its logfile entry with the 
> runid came at 18:32:05.  32:05-30:49 = 76 seconds!
> 
> This was swift 1456 compiled on terminable (or login, i forget).
> 
> Suspicious: when I was running a version compiled in tg-login under Java 
> 1.4 I would get an error message from a Java method trying to lock the 
> log file. Not sure if this logging action (which now does not give a 
> message) is related to this slow start time.
> 
> - Mike
> 
> UC64$ cat swift.out
> Swift Script local-noks.swift starting at Mon Nov 5 18:30:49 CST 2007
> 
> Swift v0.3-dev r1456
> 
> RunID: 20071105-1831-d7t5l2n3
> angle4 started
> angle4 started
> angle4 started
> angle4 started
> angle4 started
> angle4 completed
> angle4 completed
> angle4 completed
> angle4 completed
> angle4 completed
> 
> Swift Script local-noks.swift ended at Mon Nov 5 18:32:34 CST 2007 with 
> exit code 0
> 
> 
> UC64$ head awf*.log
> 2007-11-05 18:31:09,566-0600 INFO  Loader awf6.swift: source file is 
> new. Recompiling.
> 2007-11-05 18:31:30,454-0600 INFO  Karajan Validation of XML 
> intermediate file was successful
> 2007-11-05 18:31:55,465-0600 INFO  unknown Using sites file: ./sites.xml
> 2007-11-05 18:31:55,466-0600 INFO  unknown Using tc.data: ./tc.data
> 2007-11-05 18:32:05,518-0600 INFO  unknown Swift v0.3-dev r1456
> 
> 2007-11-05 18:32:05,520-0600 INFO  unknown RunID: 20071105-1831-d7t5l2n3
> 2007-11-05 18:32:06,701-0600 INFO  AbstractDataNode Found data 
> pcapfiles.$[]/1.[0]
> 2007-11-05 18:32:06,702-0600 INFO  AbstractDataNode Found data 
> pcapfiles.$[]/2.[1]
> 2007-11-05 18:32:06,702-0600 INFO  AbstractDataNode Found data 
> pcapfiles.$[]/3.[2]
> UC64$
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> 


From wilde at mcs.anl.gov  Mon Nov  5 19:19:26 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 05 Nov 2007 19:19:26 -0600
Subject: [Swift-devel] filesys_mapper doesnt take structured filenames
Message-ID: <472FC11E.2090300@mcs.anl.gov>

Ben, did you say that this mapper invocation *should* take directories? 
It doesnt seem to:

pcapfile pcapfiles[]<filesys_mapper; prefix="input/pc", suffix=".pcap">;

The full code is below. The program exits without finding anything.
The dir input/ is in my cwd when running swift and contains pc1.pcap 
thru pc5.pcap.

- Mike

UC64$ cat awf6.swift
type pcapfile;
type angleout;
type anglecenter;

(angleout ofile, anglecenter cfile) angle4 (pcapfile ifile)
{
   app { angle4 @ifile @ofile @cfile; }
}

pcapfile pcapfiles[]<filesys_mapper; prefix="input/pc", suffix=".pcap">;

angleout    of[] 
<structured_regexp_mapper;source=pcapfiles,match="input/pc(.*)\.pcap",transform="_output/of/of\1.angle">;
anglecenter cf[] 
<structured_regexp_mapper;source=pcapfiles,match="input/pc(.*)\.pcap",transform="_output/cf/cf\1.center">;

foreach pf,i in pcapfiles {
   (of[i],cf[i]) = angle4(pf);
}
UC64$

--
UC64$ pwd
/home/wilde/angle/data
UC64$ ls input
pc1.pcap  pc2.pcap  pc3.pcap  pc4.pcap  pc5.pcap
UC64$


From hategan at mcs.anl.gov  Mon Nov  5 20:45:42 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 05 Nov 2007 20:45:42 -0600
Subject: [Swift-devel] slow swift startup time
In-Reply-To: <472FBBCD.3020908@mcs.anl.gov>
References: <472FB948.8010605@mcs.anl.gov>  <472FBBCD.3020908@mcs.anl.gov>
Message-ID: <1194317143.8333.3.camel@blabla.mcs.anl.gov>

It's a warning. warning != error.

On Mon, 2007-11-05 at 18:56 -0600, Michael Wilde wrote:
> The lock error I was referring to is this, on stdout/err:
> 
>    Failed to acquire exclusive lock on log file.
> 
> Below is the log file text that accompanied it.
> 
> Note that Im not complaining about this error - it went away when I 
> started compiling on terminable again.
> 
> Im just pointing it out as a suspect in the slow startup. It surprised 
> me that we bother to lock the logfile, unless Java is gratuituously doig 
> it for us.
> 
> - Mike
> 
> Logfile head showed:
> 
> 2007-11-04 10:15:28,949-0600 INFO  Loader awf3.swift: source file is 
> new. Recompiling.
> 2007-11-04 10:15:31,315-0600 INFO  Karajan Validation of XML 
> intermediate file was successful
> 2007-11-04 10:15:34,597-0600 INFO  unknown Using sites file: ./sites.xml
> 2007-11-04 10:15:34,599-0600 INFO  unknown Using tc.data: ./tc.data
> 2007-11-04 10:15:36,606-0600 INFO  unknown Swift v0.3-dev 
> libexec/svn-revision: line 1: svn: command not found
> libexec/svn-revision: line 1: svn: command not found
> 
> 
> 2007-11-04 10:15:36,607-0600 INFO  unknown RunID: 20071104-1015-afgc18i3
> 2007-11-04 10:15:36,825-0600 INFO  AbstractDataNode Found data 
> pcapfiles.$[]/1.[0]
> 2007-11-04 10:15:36,825-0600 INFO  AbstractDataNode Found data 
> pcapfiles.$[]/2.[1]
> 2007-11-04 10:15:36,825-0600 INFO  AbstractDataNode Found data 
> pcapfiles.$[]/3.[2]
> 2007-11-04 10:15:36,825-0600 INFO  AbstractDataNode Found data 
> pcapfiles.$[]/4.[3]
> 2007-11-04 10:15:36,825-0600 INFO  AbstractDataNode Found data 
> pcapfiles.$[]/5.[4]
> 2007-11-04 10:16:26,917-0600 INFO  FlushableLockedFileWriter Could not 
> acquire lock on /home/wilde/angle/data/./awf3-20071104-1015
> -afgc18i3.0.rlog
> java.io.IOException: No locks available
>          at sun.nio.ch.FileChannelImpl.lock0(Native Method)
>          at sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:804)
>          at java.nio.channels.FileChannel.tryLock(FileChannel.java:983)
>          at 
> org.globus.cog.karajan.workflow.nodes.restartLog.FlushableLockedFileWriter.<init>(FlushableLockedFileWriter.java:26)
>          at 
> org.globus.cog.karajan.workflow.nodes.restartLog.RestartLog.create(RestartLog.java:123)
>          at 
> org.globus.cog.karajan.workflow.nodes.restartLog.RestartLog.partialArgumentsEvaluated(RestartLog.java:55)
>          at 
> org.globus.cog.karajan.workflow.nodes.PartialArgumentsContainer.executeChildren(PartialArgumentsContainer.java:51)
>          at 
> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java(Compiled 
> Code))
>          at 
> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java(Compiled 
> Code))
>          at 
> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java(Inlined 
> Compiled Code))
>          at 
> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java(Compiled 
> Code))
>          at 
> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java(Compiled 
> Code))
>          at 
> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java(Compiled 
> Code))
>          at 
> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java(Compiled 
> Code))
>          at 
> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java(Inlined 
> Compiled Code))
>          at 
> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java(Compiled 
> Code))
> 2007-11-04 10:16:26,921-0600 WARN  RestartLog Failed to acquire 
> exclusive lock on log file.
> 2007-11-04 10:16:26,999-0600 INFO  vdl:execute START thread=0-4 tr=angle4
> 2007-11-04 10:16:26,999-0600 INFO  vdl:execute START thread=0-1 tr=angle4
> 2007-11-04 10:16:26,999-0600 INFO  vdl:execute START thread=0-5 tr=angle4
> 2007-11-04 10:16:27,000-0600 INFO  vdl:execute START thread=0-3 tr=angle4
> 2007-11-04 10:16:27,001-0600 INFO  vdl:execute START thread=0-2 tr=angle4
> 
> 
> 
> On 11/5/07 6:46 PM, Michael Wilde wrote:
> > Im running on tg-login.uc.teragrid.org (IA64) and seeing very long swift 
> > start times.
> > 
> > My swift command wrapper prints the wf start and end times with the 
> > swift stdout sandwiched in between.  Here's an example of those, 
> > followed by the swift log file. In this run, i start swift in the 
> > background, then tail the stdout file. it was about 70 seconds (on my 
> > watch) before swift responded with its initial messages on stdout. (I 
> > dont think its being buffered, but thats worth checking...)
> > 
> > Note that swift was launched at 18:30:49 and its logfile entry with the 
> > runid came at 18:32:05.  32:05-30:49 = 76 seconds!
> > 
> > This was swift 1456 compiled on terminable (or login, i forget).
> > 
> > Suspicious: when I was running a version compiled in tg-login under Java 
> > 1.4 I would get an error message from a Java method trying to lock the 
> > log file. Not sure if this logging action (which now does not give a 
> > message) is related to this slow start time.
> > 
> > - Mike
> > 
> > UC64$ cat swift.out
> > Swift Script local-noks.swift starting at Mon Nov 5 18:30:49 CST 2007
> > 
> > Swift v0.3-dev r1456
> > 
> > RunID: 20071105-1831-d7t5l2n3
> > angle4 started
> > angle4 started
> > angle4 started
> > angle4 started
> > angle4 started
> > angle4 completed
> > angle4 completed
> > angle4 completed
> > angle4 completed
> > angle4 completed
> > 
> > Swift Script local-noks.swift ended at Mon Nov 5 18:32:34 CST 2007 with 
> > exit code 0
> > 
> > 
> > UC64$ head awf*.log
> > 2007-11-05 18:31:09,566-0600 INFO  Loader awf6.swift: source file is 
> > new. Recompiling.
> > 2007-11-05 18:31:30,454-0600 INFO  Karajan Validation of XML 
> > intermediate file was successful
> > 2007-11-05 18:31:55,465-0600 INFO  unknown Using sites file: ./sites.xml
> > 2007-11-05 18:31:55,466-0600 INFO  unknown Using tc.data: ./tc.data
> > 2007-11-05 18:32:05,518-0600 INFO  unknown Swift v0.3-dev r1456
> > 
> > 2007-11-05 18:32:05,520-0600 INFO  unknown RunID: 20071105-1831-d7t5l2n3
> > 2007-11-05 18:32:06,701-0600 INFO  AbstractDataNode Found data 
> > pcapfiles.$[]/1.[0]
> > 2007-11-05 18:32:06,702-0600 INFO  AbstractDataNode Found data 
> > pcapfiles.$[]/2.[1]
> > 2007-11-05 18:32:06,702-0600 INFO  AbstractDataNode Found data 
> > pcapfiles.$[]/3.[2]
> > UC64$
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> > 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From benc at hawaga.org.uk  Mon Nov  5 20:45:29 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 6 Nov 2007 02:45:29 +0000 (GMT)
Subject: [Swift-devel] filesys_mapper doesnt take structured filenames
In-Reply-To: <472FC11E.2090300@mcs.anl.gov>
References: <472FC11E.2090300@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0711060244560.20932@dildano.hawaga.org.uk>


On Mon, 5 Nov 2007, Michael Wilde wrote:

> Ben, did you say that this mapper invocation *should* take directories?

no.

-- 


From wilde at mcs.anl.gov  Mon Nov  5 21:43:51 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 05 Nov 2007 21:43:51 -0600
Subject: [Swift-devel] Re: high load on tg-grid1
In-Reply-To: <0CB49B83-3E9A-4D55-8362-1FFE1396F2FF@mcs.anl.gov>
References: <0CB49B83-3E9A-4D55-8362-1FFE1396F2FF@mcs.anl.gov>
Message-ID: <472FE2F7.10407@mcs.anl.gov>

Joe, I started a workflow with 1000 jobs - most likely thats what caused 
this. I need to check the throttles on this workflow - its possible they 
were open too wide.

Another possibility - not sure if this was cause or effect - was that I 
got hundreds of messages from PBS (job aborted messages) of the form 
that I reported to help at tg yesterday.

Im about to investigate the logs, but all my jobs are out of the queue 
now, and the workflow has completed.

(Ben: I'll be filing the log momentarily after I do an initial check of 
it. Of 1000 jobs I got about 533 result datasets returned. This was w/o 
clustering). I got 396 emails from PBS.

- Mike

(Ti: responding to tg-support as thats where Joe sent this...)

On 11/5/07 9:15 PM, joseph insley wrote:
> I'm not sure what was causing this, but the load on tg-grid1 spiked at 
> over 200 a short while ago.  It's coming back down now, but while it was 
> high I tried to submit a job through GRAM (pre-WS) and after a long wait 
> I got the error "GRAM Job submission failed because an I/O operation 
> failed (error code 3)"
> 
> At the time there were a number of globus-job-manager processes 
> belonging to Mike Wilde, but only on the order of ~30something.. it 
> doesn't seem like this should cause such a high load, so I don't know 
> what was up...
> 
> joe.
> 
> ===================================================
> joseph a. insley                                                      
> insley at mcs.anl.gov
> mathematics & computer science division       (630) 252-5649
> argonne national laboratory                               (630) 252-5986 
> (fax)
> 
> 
> 


From wilde at mcs.anl.gov  Mon Nov  5 22:01:08 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 05 Nov 2007 22:01:08 -0600
Subject: [Swift-devel] slow swift startup time
In-Reply-To: <1194317143.8333.3.camel@blabla.mcs.anl.gov>
References: <472FB948.8010605@mcs.anl.gov> <472FBBCD.3020908@mcs.anl.gov>
	<1194317143.8333.3.camel@blabla.mcs.anl.gov>
Message-ID: <472FE704.7080605@mcs.anl.gov>

Like I said when I sent it - I wasnt complaining about it as an error; I 
was wondering if the *fact* that its requesting a lock could be causing 
a delay. (Eg, trying to lock, sleeping, failing, and then continuing).

I was just trying to look for an explanation of why startup is slow.

- Mike


On 11/5/07 8:45 PM, Mihael Hategan wrote:
> It's a warning. warning != error.
> 
> On Mon, 2007-11-05 at 18:56 -0600, Michael Wilde wrote:
>> The lock error I was referring to is this, on stdout/err:
>>
>>    Failed to acquire exclusive lock on log file.
>>
>> Below is the log file text that accompanied it.
>>
>> Note that Im not complaining about this error - it went away when I 
>> started compiling on terminable again.
>>
>> Im just pointing it out as a suspect in the slow startup. It surprised 
>> me that we bother to lock the logfile, unless Java is gratuituously doig 
>> it for us.
>>
>> - Mike
>>
>> Logfile head showed:
>>
>> 2007-11-04 10:15:28,949-0600 INFO  Loader awf3.swift: source file is 
>> new. Recompiling.
>> 2007-11-04 10:15:31,315-0600 INFO  Karajan Validation of XML 
>> intermediate file was successful
>> 2007-11-04 10:15:34,597-0600 INFO  unknown Using sites file: ./sites.xml
>> 2007-11-04 10:15:34,599-0600 INFO  unknown Using tc.data: ./tc.data
>> 2007-11-04 10:15:36,606-0600 INFO  unknown Swift v0.3-dev 
>> libexec/svn-revision: line 1: svn: command not found
>> libexec/svn-revision: line 1: svn: command not found
>>
>>
>> 2007-11-04 10:15:36,607-0600 INFO  unknown RunID: 20071104-1015-afgc18i3
>> 2007-11-04 10:15:36,825-0600 INFO  AbstractDataNode Found data 
>> pcapfiles.$[]/1.[0]
>> 2007-11-04 10:15:36,825-0600 INFO  AbstractDataNode Found data 
>> pcapfiles.$[]/2.[1]
>> 2007-11-04 10:15:36,825-0600 INFO  AbstractDataNode Found data 
>> pcapfiles.$[]/3.[2]
>> 2007-11-04 10:15:36,825-0600 INFO  AbstractDataNode Found data 
>> pcapfiles.$[]/4.[3]
>> 2007-11-04 10:15:36,825-0600 INFO  AbstractDataNode Found data 
>> pcapfiles.$[]/5.[4]
>> 2007-11-04 10:16:26,917-0600 INFO  FlushableLockedFileWriter Could not 
>> acquire lock on /home/wilde/angle/data/./awf3-20071104-1015
>> -afgc18i3.0.rlog
>> java.io.IOException: No locks available
>>          at sun.nio.ch.FileChannelImpl.lock0(Native Method)
>>          at sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:804)
>>          at java.nio.channels.FileChannel.tryLock(FileChannel.java:983)
>>          at 
>> org.globus.cog.karajan.workflow.nodes.restartLog.FlushableLockedFileWriter.<init>(FlushableLockedFileWriter.java:26)
>>          at 
>> org.globus.cog.karajan.workflow.nodes.restartLog.RestartLog.create(RestartLog.java:123)
>>          at 
>> org.globus.cog.karajan.workflow.nodes.restartLog.RestartLog.partialArgumentsEvaluated(RestartLog.java:55)
>>          at 
>> org.globus.cog.karajan.workflow.nodes.PartialArgumentsContainer.executeChildren(PartialArgumentsContainer.java:51)
>>          at 
>> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java(Compiled 
>> Code))
>>          at 
>> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java(Compiled 
>> Code))
>>          at 
>> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java(Inlined 
>> Compiled Code))
>>          at 
>> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java(Compiled 
>> Code))
>>          at 
>> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java(Compiled 
>> Code))
>>          at 
>> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java(Compiled 
>> Code))
>>          at 
>> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java(Compiled 
>> Code))
>>          at 
>> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java(Inlined 
>> Compiled Code))
>>          at 
>> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java(Compiled 
>> Code))
>> 2007-11-04 10:16:26,921-0600 WARN  RestartLog Failed to acquire 
>> exclusive lock on log file.
>> 2007-11-04 10:16:26,999-0600 INFO  vdl:execute START thread=0-4 tr=angle4
>> 2007-11-04 10:16:26,999-0600 INFO  vdl:execute START thread=0-1 tr=angle4
>> 2007-11-04 10:16:26,999-0600 INFO  vdl:execute START thread=0-5 tr=angle4
>> 2007-11-04 10:16:27,000-0600 INFO  vdl:execute START thread=0-3 tr=angle4
>> 2007-11-04 10:16:27,001-0600 INFO  vdl:execute START thread=0-2 tr=angle4
>>
>>
>>
>> On 11/5/07 6:46 PM, Michael Wilde wrote:
>>> Im running on tg-login.uc.teragrid.org (IA64) and seeing very long swift 
>>> start times.
>>>
>>> My swift command wrapper prints the wf start and end times with the 
>>> swift stdout sandwiched in between.  Here's an example of those, 
>>> followed by the swift log file. In this run, i start swift in the 
>>> background, then tail the stdout file. it was about 70 seconds (on my 
>>> watch) before swift responded with its initial messages on stdout. (I 
>>> dont think its being buffered, but thats worth checking...)
>>>
>>> Note that swift was launched at 18:30:49 and its logfile entry with the 
>>> runid came at 18:32:05.  32:05-30:49 = 76 seconds!
>>>
>>> This was swift 1456 compiled on terminable (or login, i forget).
>>>
>>> Suspicious: when I was running a version compiled in tg-login under Java 
>>> 1.4 I would get an error message from a Java method trying to lock the 
>>> log file. Not sure if this logging action (which now does not give a 
>>> message) is related to this slow start time.
>>>
>>> - Mike
>>>
>>> UC64$ cat swift.out
>>> Swift Script local-noks.swift starting at Mon Nov 5 18:30:49 CST 2007
>>>
>>> Swift v0.3-dev r1456
>>>
>>> RunID: 20071105-1831-d7t5l2n3
>>> angle4 started
>>> angle4 started
>>> angle4 started
>>> angle4 started
>>> angle4 started
>>> angle4 completed
>>> angle4 completed
>>> angle4 completed
>>> angle4 completed
>>> angle4 completed
>>>
>>> Swift Script local-noks.swift ended at Mon Nov 5 18:32:34 CST 2007 with 
>>> exit code 0
>>>
>>>
>>> UC64$ head awf*.log
>>> 2007-11-05 18:31:09,566-0600 INFO  Loader awf6.swift: source file is 
>>> new. Recompiling.
>>> 2007-11-05 18:31:30,454-0600 INFO  Karajan Validation of XML 
>>> intermediate file was successful
>>> 2007-11-05 18:31:55,465-0600 INFO  unknown Using sites file: ./sites.xml
>>> 2007-11-05 18:31:55,466-0600 INFO  unknown Using tc.data: ./tc.data
>>> 2007-11-05 18:32:05,518-0600 INFO  unknown Swift v0.3-dev r1456
>>>
>>> 2007-11-05 18:32:05,520-0600 INFO  unknown RunID: 20071105-1831-d7t5l2n3
>>> 2007-11-05 18:32:06,701-0600 INFO  AbstractDataNode Found data 
>>> pcapfiles.$[]/1.[0]
>>> 2007-11-05 18:32:06,702-0600 INFO  AbstractDataNode Found data 
>>> pcapfiles.$[]/2.[1]
>>> 2007-11-05 18:32:06,702-0600 INFO  AbstractDataNode Found data 
>>> pcapfiles.$[]/3.[2]
>>> UC64$
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>
>>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>
> 
> 


From hategan at mcs.anl.gov  Mon Nov  5 22:03:44 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 05 Nov 2007 22:03:44 -0600
Subject: [Swift-devel] slow swift startup time
In-Reply-To: <472FE704.7080605@mcs.anl.gov>
References: <472FB948.8010605@mcs.anl.gov>  <472FBBCD.3020908@mcs.anl.gov>
	<1194317143.8333.3.camel@blabla.mcs.anl.gov>
	<472FE704.7080605@mcs.anl.gov>
Message-ID: <1194321824.11190.1.camel@blabla.mcs.anl.gov>


On Mon, 2007-11-05 at 22:01 -0600, Michael Wilde wrote:
> Like I said when I sent it - I wasnt complaining about it as an error; I 
> was wondering if the *fact* that its requesting a lock could be causing 
> a delay. (Eg, trying to lock, sleeping, failing, and then continuing).

I guess it depends on the fs. If you want to eliminate the possibility,
run it from a local fs.

> 
> I was just trying to look for an explanation of why startup is slow.
> 
> - Mike
> 
> 
> On 11/5/07 8:45 PM, Mihael Hategan wrote:
> > It's a warning. warning != error.
> > 
> > On Mon, 2007-11-05 at 18:56 -0600, Michael Wilde wrote:
> >> The lock error I was referring to is this, on stdout/err:
> >>
> >>    Failed to acquire exclusive lock on log file.
> >>
> >> Below is the log file text that accompanied it.
> >>
> >> Note that Im not complaining about this error - it went away when I 
> >> started compiling on terminable again.
> >>
> >> Im just pointing it out as a suspect in the slow startup. It surprised 
> >> me that we bother to lock the logfile, unless Java is gratuituously doig 
> >> it for us.
> >>
> >> - Mike
> >>
> >> Logfile head showed:
> >>
> >> 2007-11-04 10:15:28,949-0600 INFO  Loader awf3.swift: source file is 
> >> new. Recompiling.
> >> 2007-11-04 10:15:31,315-0600 INFO  Karajan Validation of XML 
> >> intermediate file was successful
> >> 2007-11-04 10:15:34,597-0600 INFO  unknown Using sites file: ./sites.xml
> >> 2007-11-04 10:15:34,599-0600 INFO  unknown Using tc.data: ./tc.data
> >> 2007-11-04 10:15:36,606-0600 INFO  unknown Swift v0.3-dev 
> >> libexec/svn-revision: line 1: svn: command not found
> >> libexec/svn-revision: line 1: svn: command not found
> >>
> >>
> >> 2007-11-04 10:15:36,607-0600 INFO  unknown RunID: 20071104-1015-afgc18i3
> >> 2007-11-04 10:15:36,825-0600 INFO  AbstractDataNode Found data 
> >> pcapfiles.$[]/1.[0]
> >> 2007-11-04 10:15:36,825-0600 INFO  AbstractDataNode Found data 
> >> pcapfiles.$[]/2.[1]
> >> 2007-11-04 10:15:36,825-0600 INFO  AbstractDataNode Found data 
> >> pcapfiles.$[]/3.[2]
> >> 2007-11-04 10:15:36,825-0600 INFO  AbstractDataNode Found data 
> >> pcapfiles.$[]/4.[3]
> >> 2007-11-04 10:15:36,825-0600 INFO  AbstractDataNode Found data 
> >> pcapfiles.$[]/5.[4]
> >> 2007-11-04 10:16:26,917-0600 INFO  FlushableLockedFileWriter Could not 
> >> acquire lock on /home/wilde/angle/data/./awf3-20071104-1015
> >> -afgc18i3.0.rlog
> >> java.io.IOException: No locks available
> >>          at sun.nio.ch.FileChannelImpl.lock0(Native Method)
> >>          at sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:804)
> >>          at java.nio.channels.FileChannel.tryLock(FileChannel.java:983)
> >>          at 
> >> org.globus.cog.karajan.workflow.nodes.restartLog.FlushableLockedFileWriter.<init>(FlushableLockedFileWriter.java:26)
> >>          at 
> >> org.globus.cog.karajan.workflow.nodes.restartLog.RestartLog.create(RestartLog.java:123)
> >>          at 
> >> org.globus.cog.karajan.workflow.nodes.restartLog.RestartLog.partialArgumentsEvaluated(RestartLog.java:55)
> >>          at 
> >> org.globus.cog.karajan.workflow.nodes.PartialArgumentsContainer.executeChildren(PartialArgumentsContainer.java:51)
> >>          at 
> >> org.globus.cog.karajan.workflow.nodes.FlowContainer.execute(FlowContainer.java(Compiled 
> >> Code))
> >>          at 
> >> org.globus.cog.karajan.workflow.nodes.FlowNode.restart(FlowNode.java(Compiled 
> >> Code))
> >>          at 
> >> org.globus.cog.karajan.workflow.nodes.FlowNode.start(FlowNode.java(Inlined 
> >> Compiled Code))
> >>          at 
> >> org.globus.cog.karajan.workflow.nodes.FlowNode.controlEvent(FlowNode.java(Compiled 
> >> Code))
> >>          at 
> >> org.globus.cog.karajan.workflow.nodes.FlowNode.event(FlowNode.java(Compiled 
> >> Code))
> >>          at 
> >> org.globus.cog.karajan.workflow.FlowElementWrapper.event(FlowElementWrapper.java(Compiled 
> >> Code))
> >>          at 
> >> org.globus.cog.karajan.workflow.events.EventBus.send(EventBus.java(Compiled 
> >> Code))
> >>          at 
> >> org.globus.cog.karajan.workflow.events.EventBus.sendHooked(EventBus.java(Inlined 
> >> Compiled Code))
> >>          at 
> >> org.globus.cog.karajan.workflow.events.EventWorker.run(EventWorker.java(Compiled 
> >> Code))
> >> 2007-11-04 10:16:26,921-0600 WARN  RestartLog Failed to acquire 
> >> exclusive lock on log file.
> >> 2007-11-04 10:16:26,999-0600 INFO  vdl:execute START thread=0-4 tr=angle4
> >> 2007-11-04 10:16:26,999-0600 INFO  vdl:execute START thread=0-1 tr=angle4
> >> 2007-11-04 10:16:26,999-0600 INFO  vdl:execute START thread=0-5 tr=angle4
> >> 2007-11-04 10:16:27,000-0600 INFO  vdl:execute START thread=0-3 tr=angle4
> >> 2007-11-04 10:16:27,001-0600 INFO  vdl:execute START thread=0-2 tr=angle4
> >>
> >>
> >>
> >> On 11/5/07 6:46 PM, Michael Wilde wrote:
> >>> Im running on tg-login.uc.teragrid.org (IA64) and seeing very long swift 
> >>> start times.
> >>>
> >>> My swift command wrapper prints the wf start and end times with the 
> >>> swift stdout sandwiched in between.  Here's an example of those, 
> >>> followed by the swift log file. In this run, i start swift in the 
> >>> background, then tail the stdout file. it was about 70 seconds (on my 
> >>> watch) before swift responded with its initial messages on stdout. (I 
> >>> dont think its being buffered, but thats worth checking...)
> >>>
> >>> Note that swift was launched at 18:30:49 and its logfile entry with the 
> >>> runid came at 18:32:05.  32:05-30:49 = 76 seconds!
> >>>
> >>> This was swift 1456 compiled on terminable (or login, i forget).
> >>>
> >>> Suspicious: when I was running a version compiled in tg-login under Java 
> >>> 1.4 I would get an error message from a Java method trying to lock the 
> >>> log file. Not sure if this logging action (which now does not give a 
> >>> message) is related to this slow start time.
> >>>
> >>> - Mike
> >>>
> >>> UC64$ cat swift.out
> >>> Swift Script local-noks.swift starting at Mon Nov 5 18:30:49 CST 2007
> >>>
> >>> Swift v0.3-dev r1456
> >>>
> >>> RunID: 20071105-1831-d7t5l2n3
> >>> angle4 started
> >>> angle4 started
> >>> angle4 started
> >>> angle4 started
> >>> angle4 started
> >>> angle4 completed
> >>> angle4 completed
> >>> angle4 completed
> >>> angle4 completed
> >>> angle4 completed
> >>>
> >>> Swift Script local-noks.swift ended at Mon Nov 5 18:32:34 CST 2007 with 
> >>> exit code 0
> >>>
> >>>
> >>> UC64$ head awf*.log
> >>> 2007-11-05 18:31:09,566-0600 INFO  Loader awf6.swift: source file is 
> >>> new. Recompiling.
> >>> 2007-11-05 18:31:30,454-0600 INFO  Karajan Validation of XML 
> >>> intermediate file was successful
> >>> 2007-11-05 18:31:55,465-0600 INFO  unknown Using sites file: ./sites.xml
> >>> 2007-11-05 18:31:55,466-0600 INFO  unknown Using tc.data: ./tc.data
> >>> 2007-11-05 18:32:05,518-0600 INFO  unknown Swift v0.3-dev r1456
> >>>
> >>> 2007-11-05 18:32:05,520-0600 INFO  unknown RunID: 20071105-1831-d7t5l2n3
> >>> 2007-11-05 18:32:06,701-0600 INFO  AbstractDataNode Found data 
> >>> pcapfiles.$[]/1.[0]
> >>> 2007-11-05 18:32:06,702-0600 INFO  AbstractDataNode Found data 
> >>> pcapfiles.$[]/2.[1]
> >>> 2007-11-05 18:32:06,702-0600 INFO  AbstractDataNode Found data 
> >>> pcapfiles.$[]/3.[2]
> >>> UC64$
> >>> _______________________________________________
> >>> Swift-devel mailing list
> >>> Swift-devel at ci.uchicago.edu
> >>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>>
> >>>
> >> _______________________________________________
> >> Swift-devel mailing list
> >> Swift-devel at ci.uchicago.edu
> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >>
> > 
> > 
> 


From wilde at mcs.anl.gov  Mon Nov  5 22:27:55 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 05 Nov 2007 22:27:55 -0600
Subject: [Swift-devel] 1000-job angle workflow gets high failure rate
In-Reply-To: <472FE2F7.10407@mcs.anl.gov>
References: <0CB49B83-3E9A-4D55-8362-1FFE1396F2FF@mcs.anl.gov>
	<472FE2F7.10407@mcs.anl.gov>
Message-ID: <472FED4B.209@mcs.anl.gov>

Was: Re: [Swift-devel] Re: high load on tg-grid1

Ben, the logs of my first 1000-job run for this week is in 
swift-logs/wilde/run153.

This run shows a high volume (396) of the same emailed PBS error 
"Aborted by PBS Server" that I first saw on Saturday night. (although it 
turns out I now see these sporadically in my email going back to august)

It produced 469 kickstart records and 1064 (out of 2000) data files.
I assume the data files came in pairs, that would be 532 succeeding 
jobs.  Its odd that 469+532=1001, but perhaps a coincidence.

Im not going to take this log apart yet; first I want to rerun with 
clustering, and check my throttles.  Possible that throttles open too 
wide are causing the PBS failures?   Possible the same issue for the 
5-wide angle run???

Also: I thought kickstart recs would get returned in a directory tree, no?

Lastly, I'd like to get the input files mapped from a tree structure.
Can structured_regexp_mapper do that? Ie, can I set its source to a dir 
rather than a swift variable? (You might have explained that, but I 
didnt get it in my notes).  If the args to this have some powerful 
variations, can you fire off a note describing?

Thanks,

Mike


On 11/5/07 9:43 PM, Michael Wilde wrote:
> Joe, I started a workflow with 1000 jobs - most likely thats what caused 
> this. I need to check the throttles on this workflow - its possible they 
> were open too wide.
> 
> Another possibility - not sure if this was cause or effect - was that I 
> got hundreds of messages from PBS (job aborted messages) of the form 
> that I reported to help at tg yesterday.
> 
> Im about to investigate the logs, but all my jobs are out of the queue 
> now, and the workflow has completed.
> 
> (Ben: I'll be filing the log momentarily after I do an initial check of 
> it. Of 1000 jobs I got about 533 result datasets returned. This was w/o 
> clustering). I got 396 emails from PBS.
> 
> - Mike
> 
> (Ti: responding to tg-support as thats where Joe sent this...)
> 
> On 11/5/07 9:15 PM, joseph insley wrote:
>> I'm not sure what was causing this, but the load on tg-grid1 spiked at 
>> over 200 a short while ago.  It's coming back down now, but while it 
>> was high I tried to submit a job through GRAM (pre-WS) and after a 
>> long wait I got the error "GRAM Job submission failed because an I/O 
>> operation failed (error code 3)"
>>
>> At the time there were a number of globus-job-manager processes 
>> belonging to Mike Wilde, but only on the order of ~30something.. it 
>> doesn't seem like this should cause such a high load, so I don't know 
>> what was up...
>>
>> joe.
>>
>> ===================================================
>> joseph a. insley                                                      
>> insley at mcs.anl.gov
>> mathematics & computer science division       (630) 252-5649
>> argonne national laboratory                               (630) 
>> 252-5986 (fax)
>>
>>
>>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> 


From benc at hawaga.org.uk  Mon Nov  5 22:04:24 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 6 Nov 2007 04:04:24 +0000 (GMT)
Subject: [Swift-devel] Re: high load on tg-grid1
In-Reply-To: <472FE2F7.10407@mcs.anl.gov>
References: <0CB49B83-3E9A-4D55-8362-1FFE1396F2FF@mcs.anl.gov>
	<472FE2F7.10407@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0711060403410.5551@dildano.hawaga.org.uk>


Did you run this with swift default throttling? If so, I'm interested to 
see the swift site scores.

On Mon, 5 Nov 2007, Michael Wilde wrote:

> Joe, I started a workflow with 1000 jobs - most likely thats what caused this.
> I need to check the throttles on this workflow - its possible they were open
> too wide.
> 
> Another possibility - not sure if this was cause or effect - was that I got
> hundreds of messages from PBS (job aborted messages) of the form that I
> reported to help at tg yesterday.
> 
> Im about to investigate the logs, but all my jobs are out of the queue now,
> and the workflow has completed.
> 
> (Ben: I'll be filing the log momentarily after I do an initial check of it. Of
> 1000 jobs I got about 533 result datasets returned. This was w/o clustering).
> I got 396 emails from PBS.
> 
> - Mike
> 
> (Ti: responding to tg-support as thats where Joe sent this...)
> 
> On 11/5/07 9:15 PM, joseph insley wrote:
> > I'm not sure what was causing this, but the load on tg-grid1 spiked at over
> > 200 a short while ago.  It's coming back down now, but while it was high I
> > tried to submit a job through GRAM (pre-WS) and after a long wait I got the
> > error "GRAM Job submission failed because an I/O operation failed (error
> > code 3)"
> > 
> > At the time there were a number of globus-job-manager processes belonging to
> > Mike Wilde, but only on the order of ~30something.. it doesn't seem like
> > this should cause such a high load, so I don't know what was up...
> > 
> > joe.
> > 
> > ===================================================
> > joseph a. insley
> > insley at mcs.anl.gov
> > mathematics & computer science division       (630) 252-5649
> > argonne national laboratory                               (630) 252-5986
> > (fax)
> > 
> > 
> > 
> 
> 


From wilde at mcs.anl.gov  Mon Nov  5 22:38:37 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 05 Nov 2007 22:38:37 -0600
Subject: [Swift-devel] Re: high load on tg-grid1
In-Reply-To: <Pine.LNX.4.64.0711060403410.5551@dildano.hawaga.org.uk>
References: <0CB49B83-3E9A-4D55-8362-1FFE1396F2FF@mcs.anl.gov>
	<472FE2F7.10407@mcs.anl.gov>
	<Pine.LNX.4.64.0711060403410.5551@dildano.hawaga.org.uk>
Message-ID: <472FEFCD.6070904@mcs.anl.gov>

I ran it with the options in the swift.properties file of that log dir 
run153.

Ive been using these for a bit, you'll need to check there what the 
settings were.

If you suggest new ones I'll try them now when I set up a clustered run.

Any suggestion on clustering size?

Also: this, unlike previous runs, is running a dummy (sleep) angle job.
What should I set that simulated run time to?  The real angle run time 
is O(60 seconds).  Want it "real" or "faster"?

- Mike


On 11/5/07 10:04 PM, Ben Clifford wrote:
> Did you run this with swift default throttling? If so, I'm interested to 
> see the swift site scores.
> 
> On Mon, 5 Nov 2007, Michael Wilde wrote:
> 
>> Joe, I started a workflow with 1000 jobs - most likely thats what caused this.
>> I need to check the throttles on this workflow - its possible they were open
>> too wide.
>>
>> Another possibility - not sure if this was cause or effect - was that I got
>> hundreds of messages from PBS (job aborted messages) of the form that I
>> reported to help at tg yesterday.
>>
>> Im about to investigate the logs, but all my jobs are out of the queue now,
>> and the workflow has completed.
>>
>> (Ben: I'll be filing the log momentarily after I do an initial check of it. Of
>> 1000 jobs I got about 533 result datasets returned. This was w/o clustering).
>> I got 396 emails from PBS.
>>
>> - Mike
>>
>> (Ti: responding to tg-support as thats where Joe sent this...)
>>
>> On 11/5/07 9:15 PM, joseph insley wrote:
>>> I'm not sure what was causing this, but the load on tg-grid1 spiked at over
>>> 200 a short while ago.  It's coming back down now, but while it was high I
>>> tried to submit a job through GRAM (pre-WS) and after a long wait I got the
>>> error "GRAM Job submission failed because an I/O operation failed (error
>>> code 3)"
>>>
>>> At the time there were a number of globus-job-manager processes belonging to
>>> Mike Wilde, but only on the order of ~30something.. it doesn't seem like
>>> this should cause such a high load, so I don't know what was up...
>>>
>>> joe.
>>>
>>> ===================================================
>>> joseph a. insley
>>> insley at mcs.anl.gov
>>> mathematics & computer science division       (630) 252-5649
>>> argonne national laboratory                               (630) 252-5986
>>> (fax)
>>>
>>>
>>>
>>
> 
> 


From wilde at mcs.anl.gov  Mon Nov  5 23:02:53 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 05 Nov 2007 23:02:53 -0600
Subject: [Swift-devel] Re: high load on tg-grid1
In-Reply-To: <Pine.LNX.4.64.0711060403410.5551@dildano.hawaga.org.uk>
References: <0CB49B83-3E9A-4D55-8362-1FFE1396F2FF@mcs.anl.gov>
	<472FE2F7.10407@mcs.anl.gov>
	<Pine.LNX.4.64.0711060403410.5551@dildano.hawaga.org.uk>
Message-ID: <472FF57D.9040803@mcs.anl.gov>

The job throttles were all set to off - thats the way I had set them for 
Falkon and I forgot to change them for PBS.

The data throttles were set to the defaults.

I'll start the next run with clustering with all throttles set to 
default, unless you suggest different (in time)

Note that my swift.properties is a subset of the full file (I only 
include ones I plan to mess with).

- Mike


On 11/5/07 10:04 PM, Ben Clifford wrote:
> Did you run this with swift default throttling? If so, I'm interested to 
> see the swift site scores.
> 
> On Mon, 5 Nov 2007, Michael Wilde wrote:
> 
>> Joe, I started a workflow with 1000 jobs - most likely thats what caused this.
>> I need to check the throttles on this workflow - its possible they were open
>> too wide.
>>
>> Another possibility - not sure if this was cause or effect - was that I got
>> hundreds of messages from PBS (job aborted messages) of the form that I
>> reported to help at tg yesterday.
>>
>> Im about to investigate the logs, but all my jobs are out of the queue now,
>> and the workflow has completed.
>>
>> (Ben: I'll be filing the log momentarily after I do an initial check of it. Of
>> 1000 jobs I got about 533 result datasets returned. This was w/o clustering).
>> I got 396 emails from PBS.
>>
>> - Mike
>>
>> (Ti: responding to tg-support as thats where Joe sent this...)
>>
>> On 11/5/07 9:15 PM, joseph insley wrote:
>>> I'm not sure what was causing this, but the load on tg-grid1 spiked at over
>>> 200 a short while ago.  It's coming back down now, but while it was high I
>>> tried to submit a job through GRAM (pre-WS) and after a long wait I got the
>>> error "GRAM Job submission failed because an I/O operation failed (error
>>> code 3)"
>>>
>>> At the time there were a number of globus-job-manager processes belonging to
>>> Mike Wilde, but only on the order of ~30something.. it doesn't seem like
>>> this should cause such a high load, so I don't know what was up...
>>>
>>> joe.
>>>
>>> ===================================================
>>> joseph a. insley
>>> insley at mcs.anl.gov
>>> mathematics & computer science division       (630) 252-5649
>>> argonne national laboratory                               (630) 252-5986
>>> (fax)
>>>
>>>
>>>
>>
> 
> 


From wilde at mcs.anl.gov  Mon Nov  5 23:56:57 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 05 Nov 2007 23:56:57 -0600
Subject: [Swift-devel] angle-1000 second run
Message-ID: <47300229.1040801@mcs.anl.gov>

I just ran a second run of angle-1000, this time with clustering.
I thought I had the throttles at default values but missed one.

I killed the run after a few hundred data files were produced because it 
was running too slowly and seemed to have reached a steady state.

The logs are in wilde/run154.

Here;s what I noted seemed wrong with this run:

1. only 4 jobs max ran at a time (as seen by qstat over many many spot 
checks)

2. only ONE data file came back before I killed the run - yet hundreds 
were produced (as seen on the server size). Surely these should have 
started trickling in by now?

3. The cluster sizes were extremely small about 4 - should have been 
10-20 by my calcs.

4. I still got over a dozen PBS job aborted messages

--

Im going to start another run and let this one go till it finishes.

I'll use totally default throttles and increase my cluster params (but I 
dont understand why the current values didnt work).

One more note: this run is using executable script angle4.fast.sh which 
has a sleep 3 as its main action. It logs misc stuff to its 2 output 
files, but otherwise takes the same args as the real angle4.sh.

Its running out of ~wilde/angle/data on tg-login1.

- Mike


From wilde at mcs.anl.gov  Tue Nov  6 00:05:37 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 06 Nov 2007 00:05:37 -0600
Subject: [Swift-devel] questions on swift properties
Message-ID: <47300431.6050306@mcs.anl.gov>

2 questions:

A few days ago I saw a spec for how to set GLOBUS::maxwalltime to values 
other than minutes.  Eg 00:nn for seconds???  But I cant find that spec 
now.  Can someone point me at it?

I thought there was a new parameter to set the min wait time between 
submissions to GT2.  I cant find that in either the etc/swift.properties 
sample or the userguide.  Am I missing is or is it not documented? 
Please describe.

Lastly - the wording of the throttle parameters in swift.properties 
confuses me even after reading them 10+ times.  Im confused between max 
# of jobs that can be running, and the rate at which they can be 
submitted.  I think these need to be reworded to clarify some confusion.

Thanks.


From wilde at mcs.anl.gov  Tue Nov  6 00:17:19 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 06 Nov 2007 00:17:19 -0600
Subject: [Swift-devel] angle-1000 3rd run
Message-ID: <473006EF.6040408@mcs.anl.gov>

Just started another angle-1000.

I trimmed my properties file to this:
--
kickstart.always.transfer=true

clustering.enabled=true
clustering.queue.delay=10
clustering.min.time=1200

sitedir.keep=true
--

Clusters are still small so far - mostly 4.

The runtime of each job is 3 secs.

I set the maxwalltime to 1 (which I think is 60 seconds) until I can 
verify how to set this in seconds.

The number of running jobs I see is still extremely low - 3 right now; 
was 1 and 2 for a while.  The cluster is wide open - lots of free cpus, 
no queue.

One improvement in this run: data seems to be flowing back almost from 
the start, unlike the previous run where almost no data result files had 
come back by the time i killed the wf.

I'll let this one run as far as it goes, and check on it in the morning 
(it should push itself to swift-logs if/when it finishes).

- Mike


From wilde at mcs.anl.gov  Tue Nov  6 06:31:36 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 06 Nov 2007 06:31:36 -0600
Subject: [Swift-devel] angle-1000 3rd run
Message-ID: <47305EA8.7080501@mcs.anl.gov>

It stopped after producing ~360 output members because the credential 
expired.  I'll need to check for that in my wrapper script.

The logs are in in swift-logs/wilde/run155

Restarting it now.


From wilde at mcs.anl.gov  Tue Nov  6 07:19:40 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 06 Nov 2007 07:19:40 -0600
Subject: [Swift-devel] angle-1000 4th run
Message-ID: <473069EC.2010408@mcs.anl.gov>

is running, but slowly:

Same behavior: only 2-3 jobs running in two spot checks.
CLuster size is still low. I must have a math error in my time specs. 
WIll check again.  Clusters are starting every 10 secs but much smaller 
than expected/desired.  Again, maxwalltime is 60 secs and 
swift.poperties cluster settings are:
clustering.enabled=true
clustering.queue.delay=10
clustering.min.time=1200

So I would have expected 20-40 jobs per cluster (2 x (1200/60))

- mike


UC64$ head swift.out
Swift script awf6.swift starting at Tue Nov 6 06:32:46 CST 2007
running on sites: UC-nfs-gt2-ks

Swift v0.3-dev r1456

RunID: 20071106-0632-asuk0my2
angle4 started
...

UC64$ qstat -u wilde

tg-master.uc.teragrid.org:
 
Req'd  Req'd   Elap
Job ID               Username Queue    Jobname    SessID NDS   TSK 
Memory Time  S Time
-------------------- -------- -------- ---------- ------ ----- --- 
------ ----- - -----
1574316.tg-master.uc wilde    dque     STDIN       20590     1  --    -- 
  00:15 R   --
1574318.tg-master.uc wilde    dque     STDIN       20666     1  --    -- 
  00:15 R   --


UC64$ date
Tue Nov  6 06:58:11 CST 2007


UC64$ grep -i cluster.*size a*.log

2007-11-06 06:34:12,327-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-0-1194352386645 with size 4
2007-11-06 06:34:22,326-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-1-1194352386651 with size 2
2007-11-06 06:35:32,332-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-2-1194352386875 with size 3
2007-11-06 06:35:42,332-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-3-1194352386883 with size 2
2007-11-06 06:35:52,333-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-4-1194352386974 with size 2
2007-11-06 06:36:02,334-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-5-1194352387000 with size 4
2007-11-06 06:36:22,335-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-6-1194352387074 with size 3
2007-11-06 06:36:32,336-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-7-1194352387112 with size 4
2007-11-06 06:36:52,337-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-8-1194352387329 with size 4
2007-11-06 06:37:52,344-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-9-1194352387719 with size 3
2007-11-06 06:38:52,350-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-10-1194352387823 with size 3
2007-11-06 06:39:52,356-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-11-1194352387927 with size 3
2007-11-06 06:40:52,362-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-12-1194352388031 with size 3
2007-11-06 06:41:52,369-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-13-1194352388135 with size 3
2007-11-06 06:42:52,376-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-14-1194352388239 with size 3
2007-11-06 06:43:52,382-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-15-1194352388343 with size 3
2007-11-06 06:45:02,389-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-16-1194352388447 with size 3
2007-11-06 06:46:02,396-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-17-1194352388551 with size 3
2007-11-06 06:47:02,403-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-18-1194352388655 with size 3
2007-11-06 06:48:02,409-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-19-1194352388759 with size 3
2007-11-06 06:49:02,415-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-20-1194352388863 with size 3
2007-11-06 06:50:02,421-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-21-1194352388975 with size 4
2007-11-06 06:51:12,428-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-22-1194352389137 with size 4
2007-11-06 06:51:22,428-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-23-1194352389147 with size 4
2007-11-06 06:52:22,434-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-24-1194352389489 with size 4
2007-11-06 06:52:42,435-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-25-1194352389503 with size 4
2007-11-06 06:52:52,517-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-26-1194352389514 with size 3
2007-11-06 06:53:12,517-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-27-1194352389579 with size 4
2007-11-06 06:53:32,519-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-28-1194352389593 with size 4
2007-11-06 06:53:42,520-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-29-1194352389703 with size 4
2007-11-06 06:54:02,521-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-30-1194352389765 with size 4
2007-11-06 06:54:22,523-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-31-1194352389952 with size 4
2007-11-06 06:54:32,523-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-32-1194352390996 with size 2
2007-11-06 06:54:42,523-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-33-1194352391028 with size 2
2007-11-06 06:54:52,524-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-34-1194352391084 with size 2
2007-11-06 06:55:02,524-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-35-1194352391098 with size 4
2007-11-06 06:55:22,630-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-36-1194352391181 with size 2
2007-11-06 06:55:32,633-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-37-1194352391192 with size 3
2007-11-06 06:55:42,635-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-38-1194352391224 with size 2
2007-11-06 06:55:52,635-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-39-1194352391244 with size 2
2007-11-06 06:56:03,050-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-40-1194352391252 with size 2
2007-11-06 06:56:12,956-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-41-1194352391290 with size 4
2007-11-06 06:56:22,956-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-42-1194352391334 with size 2
2007-11-06 06:56:32,957-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-43-1194352391366 with size 2
2007-11-06 06:56:42,958-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-44-1194352391404 with size 4
2007-11-06 06:57:02,960-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-45-1194352391466 with size 4
2007-11-06 06:57:12,960-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-46-1194352391525 with size 3
2007-11-06 06:57:32,961-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-47-1194352391614 with size 4
2007-11-06 06:57:52,964-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-48-1194352391682 with size 3
2007-11-06 06:58:02,980-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-49-1194352391735 with size 3
2007-11-06 06:58:22,982-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-50-1194352391818 with size 4
2007-11-06 06:58:32,982-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-51-1194352391829 with size 3
2007-11-06 06:58:52,982-0600 INFO  VDSAdaptiveScheduler Creating cluster 
urn:cluster-52-1194352391894 with size 4
UC64$


From benc at hawaga.org.uk  Tue Nov  6 08:11:12 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 6 Nov 2007 14:11:12 +0000 (GMT)
Subject: [Swift-devel] Re: high load on tg-grid1
In-Reply-To: <472FF57D.9040803@mcs.anl.gov>
References: <0CB49B83-3E9A-4D55-8362-1FFE1396F2FF@mcs.anl.gov>
	<472FE2F7.10407@mcs.anl.gov>
	<Pine.LNX.4.64.0711060403410.5551@dildano.hawaga.org.uk>
	<472FF57D.9040803@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0711061410060.5551@dildano.hawaga.org.uk>


On Mon, 5 Nov 2007, Michael Wilde wrote:

> The job throttles were all set to off - thats the way I had set them for
> Falkon and I forgot to change them for PBS.

Oh dear.

> I'll start the next run with clustering with all throttles set to default,
> unless you suggest different (in time)

Run with the default values that are specified in the latest SVN 
swift.properties file.

-- 


From benc at hawaga.org.uk  Tue Nov  6 08:16:11 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 6 Nov 2007 14:16:11 +0000 (GMT)
Subject: [Swift-devel] Re: angle-1000 second run
In-Reply-To: <47300229.1040801@mcs.anl.gov>
References: <47300229.1040801@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0711061411300.5551@dildano.hawaga.org.uk>


On Mon, 5 Nov 2007, Michael Wilde wrote:

> 1. only 4 jobs max ran at a time (as seen by qstat over many many spot checks)

We can look at scoring from the run.

> 2. only ONE data file came back before I killed the run - yet hundreds were
> produced (as seen on the server size). Surely these should have started
> trickling in by now?

Not if jobs were still staging in - there's one file transfer throttle 
shared between all file transfers, and stageins submitted at the start are 
going to get serviced before stage outs. That should be apparent from a 
graph if I plot it.

> 3. The cluster sizes were extremely small about 4 - should have been 10-20 by
> my calcs.

Increase the cluster queue delay parameter from 4 to about 30 (seconds). 
This will make Swift wait much longer before putting clusters together, 
which may allow more jobs to build up in the clustering queue.

Make sure that you havethe cluster maximum time and maxwalltimes for jobs 
set to sensible values, because large clusters will highlight 
misconfigurations there. In particular, note that the maximum cluster time 
in the config file needs to be (less than) half of the maxwalltime 
permitted for the site you submit to (so if you are allowewd to run 15 
minute jobs, set the cluster maximum time to 7*60, for example).

Are you using the PBS provider or GRAM to submit?


> 
> 4. I still got over a dozen PBS job aborted messages
> 
> --
> 
> Im going to start another run and let this one go till it finishes.
> 
> I'll use totally default throttles and increase my cluster params (but I dont
> understand why the current values didnt work).
> 
> One more note: this run is using executable script angle4.fast.sh which has a
> sleep 3 as its main action. It logs misc stuff to its 2 output files, but
> otherwise takes the same args as the real angle4.sh.
> 
> Its running out of ~wilde/angle/data on tg-login1.
> 
> - Mike
> 
> 
> 
> 


From benc at hawaga.org.uk  Tue Nov  6 08:32:22 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 6 Nov 2007 14:32:22 +0000 (GMT)
Subject: [Swift-devel] questions on swift properties
In-Reply-To: <47300431.6050306@mcs.anl.gov>
References: <47300431.6050306@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0711061431120.20932@dildano.hawaga.org.uk>


On Tue, 6 Nov 2007, Michael Wilde wrote:

> A few days ago I saw a spec for how to set GLOBUS::maxwalltime to values other
> than minutes.  Eg 00:nn for seconds???  But I cant find that spec now.  Can
> someone point me at it?

In the user guide, in the properties section in the globus subsection, 
it should be there.

> I thought there was a new parameter to set the min wait time between
> submissions to GT2.  I cant find that in either the etc/swift.properties
> sample or the userguide.  Am I missing is or is it not documented? Please
> describe.

can't remember.

> Lastly - the wording of the throttle parameters in swift.properties confuses
> me even after reading them 10+ times.  Im confused between max # of jobs that
> can be running, and the rate at which they can be submitted.  I think these
> need to be reworded to clarify some confusion.

yes. That's on my to-do list.

For now, use the defaults and we can see what happens.

-- 


From wilde at mcs.anl.gov  Tue Nov  6 10:19:14 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 06 Nov 2007 10:19:14 -0600
Subject: [Swift-devel] Re: angle-1000 second run
In-Reply-To: <Pine.LNX.4.64.0711061411300.5551@dildano.hawaga.org.uk>
References: <47300229.1040801@mcs.anl.gov>
	<Pine.LNX.4.64.0711061411300.5551@dildano.hawaga.org.uk>
Message-ID: <47309402.9000606@mcs.anl.gov>


>> 3. The cluster sizes were extremely small about 4 - should have been 10-20 by
>> my calcs.
> 
> Increase the cluster queue delay parameter from 4 to about 30 (seconds). 
> This will make Swift wait much longer before putting clusters together, 
> which may allow more jobs to build up in the clustering queue.

Previous run had this set to 10 seconds. The logs confirm that this was 
the clustering period: the cluster size=4 message came out every 10 seconds.

> Make sure that you havethe cluster maximum time and maxwalltimes for jobs 
> set to sensible values, because large clusters will highlight 
> misconfigurations there. In particular, note that the maximum cluster time 
> in the config file needs to be (less than) half of the maxwalltime 
> permitted for the site you submit to (so if you are allowewd to run 15 
> minute jobs, set the cluster maximum time to 7*60, for example).

I set cluster max time to 1200 with a maxwalltime of 60 seconds.

I will fiddle with this part with smaller runs till it works.

Likely I have a config issue somewhere, or theres a bug.

> Are you using the PBS provider or GRAM to submit?

GRAM, gt2.


From wilde at mcs.anl.gov  Tue Nov  6 10:28:36 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 06 Nov 2007 10:28:36 -0600
Subject: [Swift-devel] Re: Jobs being aborted by PBS server on
	tg-grid.uc.teragrid.org
In-Reply-To: <200711061619.lA6GJZmt028890@rimantadine.ncsa.uiuc.edu>
References: <200711061619.lA6GJZmt028890@rimantadine.ncsa.uiuc.edu>
Message-ID: <47309634.6090305@mcs.anl.gov>

Excellent, thanks Ti.  This explains many of our problems, I think.

- Mike


On 11/6/07 10:19 AM, help at teragrid.org wrote:
> FROM: Leggett, Ti
> (Concerning ticket No. 147814)
> 
> I think I fixed this this morning. In all the cases you were given a node in which tg-grid1 could not 
> communicate with. If you still see this, immediately run:
> 
> checkjob <jobid>
> 
> if you can and send the output. If you can't, send me the job ID.
> 
> Michael Wilde <help at teragrid.org> writes:
>> The errors below are from workflows of only 5 jobs.
>> One job of the five failed in each of these 3 incidents.
>> The failing job was then in each case retried twice more (automatically 
>> by Swift)
>>
>> GRAM was not failing to my knowledge during these times.
>>
>> Do the PBS logs indicate anything?
>>
>> - Mike
>>
>>
>> On 11/6/07 9:52 AM, help at teragrid.org wrote:
>>> FROM: Leggett, Ti
>>> (Concerning ticket No. 147814)
>>>
>>> Are you getting these when you're submitting many (thousands) of jobs and does it coincide with 
> the 
>>> gatekeeper becoming unavailable?
>>>
>>> Michael Wilde <help at teragrid.org> writes:
>>>> Im starting to see more frequent problems like this.
>>>> Happened once last night to 3 consecutive jobs, and tonight happened 
>>>> twice, to 6 jobs.
>>>>
>>>> Ti, could you look in the PBS logs, possibly on the related node(s) and 
>>>> see if its looking like a problem on tg-uc or on our side?
>>>>
>>>> Thanks,
>>>>
>>>> Mike
>>>>
>>>>
>>>> 11/3 8:05 PM - 3 failures
>>>>  Job IDs 1571647, 48, & 49
>>>> 11/4 7:46 PM - 3 failures
>>>>  Job IDs 1572031, 33, & 34
>>>> 11/4 8:56 - 8:57 PM
>>>>  1572040, 42, 43
>>>>
>>>> All errors have the format below.
>>>>
>>>> Swift retries failing jobs 3 times, hence the groups of 3 above.
>>>>
>>>>
>>>> -------- Original Message --------
>>>> Subject: PBS JOB 1572043.tg-master.uc.teragrid.org
>>>> Date: Sun,  4 Nov 2007 20:57:11 -0600 (CST)
>>>> From: adm at tg-master.uc.teragrid.org (root)
>>>> To: wilde at tg-grid1.uc.teragrid.org
>>>>
>>>> PBS Job Id: 1572043.tg-master.uc.teragrid.org
>>>> Job Name:   STDIN
>>>> Aborted by PBS Server
>>>> Job cannot be executed
>>>> See Administrator for help
>>>
> 
> 


From benc at hawaga.org.uk  Tue Nov  6 10:11:03 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 6 Nov 2007 16:11:03 +0000 (GMT)
Subject: [Swift-devel] minimum rate limit patch for karajan
Message-ID: <Pine.LNX.4.64.0711061607570.5551@dildano.hawaga.org.uk>


Below is a patch that puts a lower bound on the site scoring in Karajan. 
This reduces catastrophic problems caused when a large number of jobs fail 
at once, pushing the site score so low that it never (during the 
workflow run) recovers.

I think this would be useful for Mike in his angle workflows.

Also at:
http://www.ci.uchicago.edu/~benc/andrew-ratelimit-minimum

Index: cog/modules/karajan/src/org/globus/cog/karajan/scheduler/WeightedHost.java
===================================================================
--- cog.orig/modules/karajan/src/org/globus/cog/karajan/scheduler/WeightedHost.java	2007-07-13 11:16:11.000000000 +0100
+++ cog/modules/karajan/src/org/globus/cog/karajan/scheduler/WeightedHost.java	2007-10-29 21:30:43.000000000 +0000
@@ -38,6 +38,8 @@
 	}
 
 	protected void setScore(double score) {
+		final int MINWEIGHT = -10;
+		if(score<MINWEIGHT) score=MINWEIGHT;
 		this.score = new Double(score);
 		this.tscore = smooth(score);
 	}


From wilde at mcs.anl.gov  Tue Nov  6 11:24:50 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 06 Nov 2007 11:24:50 -0600
Subject: [Swift-devel] Re: angle-1000 second run
In-Reply-To: <47309402.9000606@mcs.anl.gov>
References: <47300229.1040801@mcs.anl.gov>
	<Pine.LNX.4.64.0711061411300.5551@dildano.hawaga.org.uk>
	<47309402.9000606@mcs.anl.gov>
Message-ID: <4730A362.7020406@mcs.anl.gov>

It seems that the cluster problem is also due to the slow speed of input 
data file stage-in.

It took 6 minutes to stage in 60 40MB input files to uc-tg
(this is to NFS; I will try GPFS as well).

So at 10 files per minute, if we check the cluster queue every 30 
seconds, that about 5 jobs per cluster on average, which explains what 
we're seeing.

10 fpm = 400MB/min = 6.5MB/sec.  Note that Im submitting from the login 
node to the same cluster - seems very slow.

I will test further and try to calibrate the expected speeds on a big file.

- Mike


On 11/6/07 10:19 AM, Michael Wilde wrote:
> 
>>> 3. The cluster sizes were extremely small about 4 - should have been 
>>> 10-20 by
>>> my calcs.
>>
>> Increase the cluster queue delay parameter from 4 to about 30 
>> (seconds). This will make Swift wait much longer before putting 
>> clusters together, which may allow more jobs to build up in the 
>> clustering queue.
> 
> Previous run had this set to 10 seconds. The logs confirm that this was 
> the clustering period: the cluster size=4 message came out every 10 
> seconds.
> 
>> Make sure that you havethe cluster maximum time and maxwalltimes for 
>> jobs set to sensible values, because large clusters will highlight 
>> misconfigurations there. In particular, note that the maximum cluster 
>> time in the config file needs to be (less than) half of the 
>> maxwalltime permitted for the site you submit to (so if you are 
>> allowewd to run 15 minute jobs, set the cluster maximum time to 7*60, 
>> for example).
> 
> I set cluster max time to 1200 with a maxwalltime of 60 seconds.
> 
> I will fiddle with this part with smaller runs till it works.
> 
> Likely I have a config issue somewhere, or theres a bug.
> 
>> Are you using the PBS provider or GRAM to submit?
> 
> GRAM, gt2.
> 


From hategan at mcs.anl.gov  Tue Nov  6 11:32:05 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 06 Nov 2007 11:32:05 -0600
Subject: [Swift-devel] Re: angle-1000 second run
In-Reply-To: <4730A362.7020406@mcs.anl.gov>
References: <47300229.1040801@mcs.anl.gov>
	<Pine.LNX.4.64.0711061411300.5551@dildano.hawaga.org.uk>
	<47309402.9000606@mcs.anl.gov>  <4730A362.7020406@mcs.anl.gov>
Message-ID: <1194370326.16107.4.camel@blabla.mcs.anl.gov>

On Tue, 2007-11-06 at 11:24 -0600, Michael Wilde wrote:
> It seems that the cluster problem is also due to the slow speed of input 
> data file stage-in.

Sounds likely.

> 
> It took 6 minutes to stage in 60 40MB input files to uc-tg
> (this is to NFS; I will try GPFS as well).
> 
> So at 10 files per minute, if we check the cluster queue every 30 
> seconds, that about 5 jobs per cluster on average, which explains what 
> we're seeing.
> 
> 10 fpm = 400MB/min = 6.5MB/sec.  Note that Im submitting from the login 
> node to the same cluster - seems very slow.

You should also factor in protocol latencies and various things like
directory creation/checks.

> 
> I will test further and try to calibrate the expected speeds on a big file.
> 
> - Mike
> 
> 
> On 11/6/07 10:19 AM, Michael Wilde wrote:
> > 
> >>> 3. The cluster sizes were extremely small about 4 - should have been 
> >>> 10-20 by
> >>> my calcs.
> >>
> >> Increase the cluster queue delay parameter from 4 to about 30 
> >> (seconds). This will make Swift wait much longer before putting 
> >> clusters together, which may allow more jobs to build up in the 
> >> clustering queue.
> > 
> > Previous run had this set to 10 seconds. The logs confirm that this was 
> > the clustering period: the cluster size=4 message came out every 10 
> > seconds.
> > 
> >> Make sure that you havethe cluster maximum time and maxwalltimes for 
> >> jobs set to sensible values, because large clusters will highlight 
> >> misconfigurations there. In particular, note that the maximum cluster 
> >> time in the config file needs to be (less than) half of the 
> >> maxwalltime permitted for the site you submit to (so if you are 
> >> allowewd to run 15 minute jobs, set the cluster maximum time to 7*60, 
> >> for example).
> > 
> > I set cluster max time to 1200 with a maxwalltime of 60 seconds.
> > 
> > I will fiddle with this part with smaller runs till it works.
> > 
> > Likely I have a config issue somewhere, or theres a bug.
> > 
> >> Are you using the PBS provider or GRAM to submit?
> > 
> > GRAM, gt2.
> > 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From benc at hawaga.org.uk  Tue Nov  6 12:20:07 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 6 Nov 2007 18:20:07 +0000 (GMT)
Subject: [Swift-devel] Re: angle-1000 second run
In-Reply-To: <4730A362.7020406@mcs.anl.gov>
References: <47300229.1040801@mcs.anl.gov>
	<Pine.LNX.4.64.0711061411300.5551@dildano.hawaga.org.uk>
	<47309402.9000606@mcs.anl.gov> <4730A362.7020406@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0711061819490.5551@dildano.hawaga.org.uk>


hitting the transfer throttle a lot according to this: 
http://www.ci.uchicago.edu/~benc/log-processing/report-awf6-20071106-1101-yxipkgyg/


On Tue, 6 Nov 2007, Michael Wilde wrote:

> It seems that the cluster problem is also due to the slow speed of input data
> file stage-in.
> 
> It took 6 minutes to stage in 60 40MB input files to uc-tg
> (this is to NFS; I will try GPFS as well).
> 
> So at 10 files per minute, if we check the cluster queue every 30 seconds,
> that about 5 jobs per cluster on average, which explains what we're seeing.
> 
> 10 fpm = 400MB/min = 6.5MB/sec.  Note that Im submitting from the login node
> to the same cluster - seems very slow.
> 
> I will test further and try to calibrate the expected speeds on a big file.
> 
> - Mike
> 
> 
> On 11/6/07 10:19 AM, Michael Wilde wrote:
> > 
> > > > 3. The cluster sizes were extremely small about 4 - should have been
> > > > 10-20 by
> > > > my calcs.
> > > 
> > > Increase the cluster queue delay parameter from 4 to about 30 (seconds).
> > > This will make Swift wait much longer before putting clusters together,
> > > which may allow more jobs to build up in the clustering queue.
> > 
> > Previous run had this set to 10 seconds. The logs confirm that this was the
> > clustering period: the cluster size=4 message came out every 10 seconds.
> > 
> > > Make sure that you havethe cluster maximum time and maxwalltimes for jobs
> > > set to sensible values, because large clusters will highlight
> > > misconfigurations there. In particular, note that the maximum cluster time
> > > in the config file needs to be (less than) half of the maxwalltime
> > > permitted for the site you submit to (so if you are allowewd to run 15
> > > minute jobs, set the cluster maximum time to 7*60, for example).
> > 
> > I set cluster max time to 1200 with a maxwalltime of 60 seconds.
> > 
> > I will fiddle with this part with smaller runs till it works.
> > 
> > Likely I have a config issue somewhere, or theres a bug.
> > 
> > > Are you using the PBS provider or GRAM to submit?
> > 
> > GRAM, gt2.
> > 
> 
> 


From hategan at mcs.anl.gov  Tue Nov  6 12:37:50 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 06 Nov 2007 12:37:50 -0600
Subject: [Swift-devel] Re: angle-1000 second run
In-Reply-To: <Pine.LNX.4.64.0711061819490.5551@dildano.hawaga.org.uk>
References: <47300229.1040801@mcs.anl.gov>
	<Pine.LNX.4.64.0711061411300.5551@dildano.hawaga.org.uk>
	<47309402.9000606@mcs.anl.gov> <4730A362.7020406@mcs.anl.gov>
	<Pine.LNX.4.64.0711061819490.5551@dildano.hawaga.org.uk>
Message-ID: <1194374270.18379.2.camel@blabla.mcs.anl.gov>

So I just spoke to Bill. The errors we see when transfers go up, we
should not see them. In the tests they've done a while ago hundreds of
parallel transfers on typical machines were not a problem.

So we need to isolate the issue. Possible causes:
1. The Java GridFTP client
2. The CI network
3. Problems introduced in the server after the tests above.

Mihael

On Tue, 2007-11-06 at 18:20 +0000, Ben Clifford wrote:
> hitting the transfer throttle a lot according to this: 
> http://www.ci.uchicago.edu/~benc/log-processing/report-awf6-20071106-1101-yxipkgyg/
> 
> 
> On Tue, 6 Nov 2007, Michael Wilde wrote:
> 
> > It seems that the cluster problem is also due to the slow speed of input data
> > file stage-in.
> > 
> > It took 6 minutes to stage in 60 40MB input files to uc-tg
> > (this is to NFS; I will try GPFS as well).
> > 
> > So at 10 files per minute, if we check the cluster queue every 30 seconds,
> > that about 5 jobs per cluster on average, which explains what we're seeing.
> > 
> > 10 fpm = 400MB/min = 6.5MB/sec.  Note that Im submitting from the login node
> > to the same cluster - seems very slow.
> > 
> > I will test further and try to calibrate the expected speeds on a big file.
> > 
> > - Mike
> > 
> > 
> > On 11/6/07 10:19 AM, Michael Wilde wrote:
> > > 
> > > > > 3. The cluster sizes were extremely small about 4 - should have been
> > > > > 10-20 by
> > > > > my calcs.
> > > > 
> > > > Increase the cluster queue delay parameter from 4 to about 30 (seconds).
> > > > This will make Swift wait much longer before putting clusters together,
> > > > which may allow more jobs to build up in the clustering queue.
> > > 
> > > Previous run had this set to 10 seconds. The logs confirm that this was the
> > > clustering period: the cluster size=4 message came out every 10 seconds.
> > > 
> > > > Make sure that you havethe cluster maximum time and maxwalltimes for jobs
> > > > set to sensible values, because large clusters will highlight
> > > > misconfigurations there. In particular, note that the maximum cluster time
> > > > in the config file needs to be (less than) half of the maxwalltime
> > > > permitted for the site you submit to (so if you are allowewd to run 15
> > > > minute jobs, set the cluster maximum time to 7*60, for example).
> > > 
> > > I set cluster max time to 1200 with a maxwalltime of 60 seconds.
> > > 
> > > I will fiddle with this part with smaller runs till it works.
> > > 
> > > Likely I have a config issue somewhere, or theres a bug.
> > > 
> > > > Are you using the PBS provider or GRAM to submit?
> > > 
> > > GRAM, gt2.
> > > 
> > 
> > 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From benc at hawaga.org.uk  Tue Nov  6 14:32:07 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 6 Nov 2007 20:32:07 +0000 (GMT)
Subject: [Swift-devel] unexpected event state sequences
Message-ID: <Pine.LNX.4.64.0711062030160.5551@dildano.hawaga.org.uk>


I've seen this a few times in different runs - file trasnfer task going 
through sequence of states Submitted -> Failed -> Active

(my log processing assumes that active isn't a final state...)


2007-11-06 13:04:30,052-0600 DEBUG TaskImpl Task(type=FILE_TRANSFER, 
identity=ur
n:0-127-1-1194375466307) setting status to Submitted

2007-11-06 13:04:30,056-0600 DEBUG TaskImpl Task(type=FILE_TRANSFER, 
identity=ur
n:0-118-1-1194375466305) setting status to Failed Error communicating with 
the G
ridFTP server

2007-11-06 13:04:30,057-0600 DEBUG TaskImpl Task(type=FILE_TRANSFER, 
identity=ur
n:0-127-1-1194375466307) setting status to Failed Error communicating with 
the G
ridFTP server

2007-11-06 13:04:30,321-0600 DEBUG TaskImpl Task(type=FILE_TRANSFER, 
identity=ur
n:0-127-1-1194375466307) setting status to Active


From wilde at mcs.anl.gov  Tue Nov  6 23:36:19 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 06 Nov 2007 23:36:19 -0600
Subject: [Swift-devel] started angle-1000 using ci-san data and ext mapper
Message-ID: <47314ED3.7010404@mcs.anl.gov>

started this running at 23:07, ~wilde/angle/data.
processing the files of spool_1 and spool_2 and naming the outputs 
accordingly.

I spotted several failures to create a kickstart dir in the prior run 
(which i killed) and at least one such error in this run.

Ive gotten about 10 failures so far from PBS aborts; looks like a node 
is bad again (sent mail).

Added a zcat to the app to decompress the uic data.

Am using Mihael's ext mapper, and its working great so far.

This is what my *entire* mapper looks like:

--

#! /bin/sh
awk <angle-spool-1-2 '
BEGIN { server="gsiftp://tp-osg.ci.uchicago.edu//disks/ci-gpfs/angle"; }
{ printf "[%d] %s/%s\n", i++, server, $0 }'

--

In a next pass this mapper will take as params the spool dir(s) to map, 
and will do a find on those dirs.


From benc at hawaga.org.uk  Wed Nov  7 08:13:18 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 7 Nov 2007 14:13:18 +0000 (GMT)
Subject: [Swift-devel] Re: started angle-1000 using ci-san data and ext
	mapper
In-Reply-To: <47314ED3.7010404@mcs.anl.gov>
References: <47314ED3.7010404@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0711071409200.5551@dildano.hawaga.org.uk>


On Tue, 6 Nov 2007, Michael Wilde wrote:

> Ive gotten about 10 failures so far from PBS aborts; looks like a node is bad
> again (sent mail).

713 attempts to run jobs worked, 416 failed.  (that's at the execute2 
level)

Looks like a combination of file transfer failures and job execution 
failures.

-- 


From wilde at mcs.anl.gov  Wed Nov  7 08:32:24 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 07 Nov 2007 08:32:24 -0600
Subject: [Swift-devel] Re: started angle-1000 using ci-san data and ext
	mapper
In-Reply-To: <Pine.LNX.4.64.0711071409200.5551@dildano.hawaga.org.uk>
References: <47314ED3.7010404@mcs.anl.gov>
	<Pine.LNX.4.64.0711071409200.5551@dildano.hawaga.org.uk>
Message-ID: <4731CC78.6080108@mcs.anl.gov>

In IM Ben said:

--
Ben Clifford
its possible to change swift to retry jobs more than 3 times.
i did that with andrew with it up at 10
sometimes jobs were running 5 times or so
it doesn't fix broken nodes but it increases chances
of workflow completion.
--

Sounds good, will try. With this kind of cluster problem, there's little 
else we can do from outside the cluster.


On 11/7/07 8:13 AM, Ben Clifford wrote:
> 
> On Tue, 6 Nov 2007, Michael Wilde wrote:
> 
>> Ive gotten about 10 failures so far from PBS aborts; looks like a node is bad
>> again (sent mail).
> 
> 713 attempts to run jobs worked, 416 failed.  (that's at the execute2 
> level)
> 
> Looks like a combination of file transfer failures and job execution 
> failures.
> 


From benc at hawaga.org.uk  Wed Nov  7 08:36:36 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 7 Nov 2007 14:36:36 +0000 (GMT)
Subject: [Swift-devel] Re: started angle-1000 using ci-san data and ext
	mapper
In-Reply-To: <4731CC78.6080108@mcs.anl.gov>
References: <47314ED3.7010404@mcs.anl.gov>
	<Pine.LNX.4.64.0711071409200.5551@dildano.hawaga.org.uk>
	<4731CC78.6080108@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0711071433320.5551@dildano.hawaga.org.uk>


On Wed, 7 Nov 2007, Michael Wilde wrote:

> Sounds good, will try. With this kind of cluster problem, there's little else
> we can do from outside the cluster.

Did you get stuff working on another site yet? Site selection should cause 
better sites to be preferred by magic.

-- 


From wilde at mcs.anl.gov  Wed Nov  7 08:52:06 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 07 Nov 2007 08:52:06 -0600
Subject: [Swift-devel] Re: started angle-1000 using ci-san data and ext
	mapper
In-Reply-To: <Pine.LNX.4.64.0711071433320.5551@dildano.hawaga.org.uk>
References: <47314ED3.7010404@mcs.anl.gov>
	<Pine.LNX.4.64.0711071409200.5551@dildano.hawaga.org.uk>
	<4731CC78.6080108@mcs.anl.gov>
	<Pine.LNX.4.64.0711071433320.5551@dildano.hawaga.org.uk>
Message-ID: <4731D116.6070704@mcs.anl.gov>

No, have not - am working on it. Let me know if you can help.

Is that error retry change ( 3 => 10) must be a patch, as I dont see a 
property for it?

Lets also discuss what options we have for working around the data 
transfer problem.

I'd like to propose two test runs:

- data is all local, no gridftp, and data xfer is unthrottled
- job throttle wide open, but job delivery rate slowed down to a GT2 
happy-level.

If we have a central job dispatcher that is aware of where data is from 
a simple map, want to see if we can then achieve fast runs.

I'll be as UC in an hour but want to start a test run first.  What 
should we run next?


On 11/7/07 8:36 AM, Ben Clifford wrote:
> 
> On Wed, 7 Nov 2007, Michael Wilde wrote:
> 
>> Sounds good, will try. With this kind of cluster problem, there's little else
>> we can do from outside the cluster.
> 
> Did you get stuff working on another site yet? Site selection should cause 
> better sites to be preferred by magic.
> 


From benc at hawaga.org.uk  Wed Nov  7 09:00:45 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 7 Nov 2007 15:00:45 +0000 (GMT)
Subject: [Swift-devel] Re: started angle-1000 using ci-san data and ext
	mapper
In-Reply-To: <4731D116.6070704@mcs.anl.gov>
References: <47314ED3.7010404@mcs.anl.gov>
	<Pine.LNX.4.64.0711071409200.5551@dildano.hawaga.org.uk>
	<4731CC78.6080108@mcs.anl.gov>
	<Pine.LNX.4.64.0711071433320.5551@dildano.hawaga.org.uk>
	<4731D116.6070704@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0711071455360.5551@dildano.hawaga.org.uk>


On Wed, 7 Nov 2007, Michael Wilde wrote:

> Is that error retry change ( 3 => 10) must be a patch, as I dont see a
> property for it?

yes. though I find myself tweaking it enough recently that I'll add it a a 
property sometime soon, I think.

http://www.ci.uchicago.edu/~benc/andrew-many-retries

> - job throttle wide open, but job delivery rate slowed down to a GT2
> happy-level.

not sure what that means.

> I'll be as UC in an hour but want to start a test run first.  What should we
> run next?

In the absence of a reliable site to run on, not sure what there is to do.

-- 


From wilde at mcs.anl.gov  Wed Nov  7 09:31:07 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 07 Nov 2007 09:31:07 -0600
Subject: [Swift-devel] Re: started angle-1000 using ci-san data and ext
	mapper
In-Reply-To: <Pine.LNX.4.64.0711071455360.5551@dildano.hawaga.org.uk>
References: <47314ED3.7010404@mcs.anl.gov>
	<Pine.LNX.4.64.0711071409200.5551@dildano.hawaga.org.uk>
	<4731CC78.6080108@mcs.anl.gov>
	<Pine.LNX.4.64.0711071433320.5551@dildano.hawaga.org.uk>
	<4731D116.6070704@mcs.anl.gov>
	<Pine.LNX.4.64.0711071455360.5551@dildano.hawaga.org.uk>
Message-ID: <4731DA3B.8000308@mcs.anl.gov>


>> I'll be as UC in an hour but want to start a test run first.  What should we
>> run next?
> 
> In the absence of a reliable site to run on, not sure what there is to do.

Im going to run a test on just the data stagein problem: same data but 
to 1 job.

That should help separate the throttle problems from the basic data 
problems.


From benc at hawaga.org.uk  Wed Nov  7 10:05:25 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 7 Nov 2007 16:05:25 +0000 (GMT)
Subject: [Swift-devel] Re: started angle-1000 using ci-san data and ext
	mapper
In-Reply-To: <4731DA3B.8000308@mcs.anl.gov>
References: <47314ED3.7010404@mcs.anl.gov>
	<Pine.LNX.4.64.0711071409200.5551@dildano.hawaga.org.uk>
	<4731CC78.6080108@mcs.anl.gov>
	<Pine.LNX.4.64.0711071433320.5551@dildano.hawaga.org.uk>
	<4731D116.6070704@mcs.anl.gov>
	<Pine.LNX.4.64.0711071455360.5551@dildano.hawaga.org.uk>
	<4731DA3B.8000308@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0711071604250.5551@dildano.hawaga.org.uk>

now that you're running with larger compute jobs, the amount of time spent 
in staging as a proportion of the whole runtime is much less.

also, are you running with lazy errors on or off (that is set in the swift 
properties). I think off. for the purposes of letting runs continue 
longer, that might be a useful setting to turn on.

On Wed, 7 Nov 2007, Michael Wilde wrote:

> 
> > > I'll be as UC in an hour but want to start a test run first.  What should
> > > we
> > > run next?
> > 
> > In the absence of a reliable site to run on, not sure what there is to do.
> 
> Im going to run a test on just the data stagein problem: same data but to 1
> job.
> 
> That should help separate the throttle problems from the basic data problems.
> 
> 
> 


From benc at hawaga.org.uk  Wed Nov  7 10:08:16 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 7 Nov 2007 16:08:16 +0000 (GMT)
Subject: [Swift-devel] Re: started angle-1000 using ci-san data and ext
	mapper
In-Reply-To: <Pine.LNX.4.64.0711071604250.5551@dildano.hawaga.org.uk>
References: <47314ED3.7010404@mcs.anl.gov>
	<Pine.LNX.4.64.0711071409200.5551@dildano.hawaga.org.uk>
	<4731CC78.6080108@mcs.anl.gov>
	<Pine.LNX.4.64.0711071433320.5551@dildano.hawaga.org.uk>
	<4731D116.6070704@mcs.anl.gov>
	<Pine.LNX.4.64.0711071455360.5551@dildano.hawaga.org.uk>
	<4731DA3B.8000308@mcs.anl.gov>
	<Pine.LNX.4.64.0711071604250.5551@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0711071607300.5551@dildano.hawaga.org.uk>

also cog r1833 has a change in logging that makes log processing work 
better. that would be good to use in future runs.
-- 


From wilde at mcs.anl.gov  Wed Nov  7 12:44:39 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 07 Nov 2007 12:44:39 -0600
Subject: [Swift-devel] Data transfer test results
Message-ID: <47320797.9070106@mcs.anl.gov>

I ran a swift script that invoked one app with the same 1000 45MB input 
files that angle-1000 reads.

It took an hour to stage in these files from local disk to uc-tg.  I 
think thats way too slow and suggests that we have a problem in the 
basic data transfer mechanism.
--
Swift script t1.swift starting at Wed Nov 7 10:39:20 CST 2007running on 
sites: UC-nfs-gt2-ksSwift v0.3-dev r1463RunID: 20071107-1039-unmh7sed

catls started
catls completed

Swift Script t1.swift ended at Wed Nov 7 11:41:17 CST 2007 with exit code 0
--

The logs are in swift-logs/wilde/run172.

This merits analysis.

I will re-run this with the data coming from CI SAN gridftp via teraport 
server.

- Mike


From wilde at mcs.anl.gov  Wed Nov  7 12:49:32 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 07 Nov 2007 12:49:32 -0600
Subject: [Swift-devel] Please re-send info on GridFTP servers for CI SAN
Message-ID: <473208BC.3040803@mcs.anl.gov>

Ti, a question for our SC analytics challenge:

Can you re-post the info you sent a while back on the gridftp server on 
the CI SAN?

We can access the space from the tp-osg gridftp server; when I tried to 
do so yesterday I ran into errors.  I didnt record, because it wasnt 
clear to me if I was using the right URL to contact it. (I tried 
stor.ci.uchicago.edu in at least one run).  Then I tried stor1.  I think 
stor failed and stor1 hung.

Before I try to capture all this and post here for debugging, can you 
tell us what the correct URL is to use the server, and any other 
considerations for striped transfer?

Should stor be a faster server than tp-osg for this data, or same?

can you post a few how-to notes on this to the CI wiki page related to 
using the SAN?

Thanks,

Mike


From benc at hawaga.org.uk  Wed Nov  7 12:58:18 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 7 Nov 2007 18:58:18 +0000 (GMT)
Subject: [Swift-devel] Re: Data transfer test results
In-Reply-To: <47320797.9070106@mcs.anl.gov>
References: <47320797.9070106@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0711071856170.5551@dildano.hawaga.org.uk>


On Wed, 7 Nov 2007, Michael Wilde wrote:

> I ran a swift script that invoked one app with the same 1000 45MB input files
> that angle-1000 reads.

45mb files don't seem particularly representative of the data - I picked 
spool_190 at random and see that the average file size is 14mb but that 
many are small, in the 20kb range.

Tuning for 45mb files is likely to not be the same as tuning for 23kb 
files.

-- 


From benc at hawaga.org.uk  Wed Nov  7 13:17:18 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 7 Nov 2007 19:17:18 +0000 (GMT)
Subject: [Swift-devel] Data transfer test results
In-Reply-To: <47320797.9070106@mcs.anl.gov>
References: <47320797.9070106@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0711071916570.20932@dildano.hawaga.org.uk>


On Wed, 7 Nov 2007, Michael Wilde wrote:

> It took an hour to stage in these files from local disk to uc-tg.  I think
> thats way too slow and suggests that we have a problem in the basic data
> transfer mechanism.

Workflow spent whole time pegged at 64 transfers at once maximum.

-- 


From benc at hawaga.org.uk  Wed Nov  7 13:19:41 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 7 Nov 2007 19:19:41 +0000 (GMT)
Subject: [Swift-devel] Data transfer test results
In-Reply-To: <Pine.LNX.4.64.0711071916570.20932@dildano.hawaga.org.uk>
References: <47320797.9070106@mcs.anl.gov>
	<Pine.LNX.4.64.0711071916570.20932@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0711071918460.20932@dildano.hawaga.org.uk>


On Wed, 7 Nov 2007, Ben Clifford wrote:

> > It took an hour to stage in these files from local disk to uc-tg.  I think
> > thats way too slow and suggests that we have a problem in the basic data
> > transfer mechanism.
> 
> Workflow spent whole time pegged at 64 transfers at once maximum.

and stats on karajan file transfer tasks are:

Total number of events: 1004
Shortest event (s): 0.223000049591064
Longest event (s): 284.994999885559
Total duration of all events (s): 225762.589998722
Mean event duration (s): 224.863137448926

45mb / 224s = 200kb/s which is pretty ass. that 200kb isn't caused by rate 
limiting at the karajan scheduler level, though.

-- 


From wilde at mcs.anl.gov  Wed Nov  7 13:25:54 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 07 Nov 2007 13:25:54 -0600
Subject: [Swift-devel] Re: Data transfer test results
In-Reply-To: <Pine.LNX.4.64.0711071856170.5551@dildano.hawaga.org.uk>
References: <47320797.9070106@mcs.anl.gov>
	<Pine.LNX.4.64.0711071856170.5551@dildano.hawaga.org.uk>
Message-ID: <47321142.7030208@mcs.anl.gov>

I think in general they are larger, but I will investigate.

In the meantime, run173 just finished, staging in data from the tp-osg 
server.  This went very nice: staged in 1000 45MB files in 10 minutes!

I think yesterday I measured the disk-to-disk copy time using dd of a 
2.5GB file at 2 minutes, so this WAN transfer from CI to Argonne at 10 
minutes is only about 2.5X slower.  Thats not bad, and 10 minutes to 
stage the whole dataset is not bad.

Lets discuss net how best to achieve or simulate/hack caching of inputs 
on the local site.  Whats the best way to do that and test it?

- Mike

(btw - run173 above failed in the end, I think, due to long cmd line 
length.  Need to discuss that as well, as we may need to demo a 
summarization job such as this test simulates).

On 11/7/07 12:58 PM, Ben Clifford wrote:
> 
> On Wed, 7 Nov 2007, Michael Wilde wrote:
> 
>> I ran a swift script that invoked one app with the same 1000 45MB input files
>> that angle-1000 reads.
> 
> 45mb files don't seem particularly representative of the data - I picked 
> spool_190 at random and see that the average file size is 14mb but that 
> many are small, in the 20kb range.
> 
> Tuning for 45mb files is likely to not be the same as tuning for 23kb 
> files.
> 


From benc at hawaga.org.uk  Wed Nov  7 13:31:43 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 7 Nov 2007 19:31:43 +0000 (GMT)
Subject: [Swift-devel] Data transfer test results
In-Reply-To: <Pine.LNX.4.64.0711071918460.20932@dildano.hawaga.org.uk>
References: <47320797.9070106@mcs.anl.gov>
	<Pine.LNX.4.64.0711071916570.20932@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0711071918460.20932@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0711071927590.5551@dildano.hawaga.org.uk>


in t1, you have no structure to your submit files and you're staging into 
up to 8 different FTP servers; I could imagine that there's shared 
filesystem contention there as eight servers pass the lock for the root of 
the site-side data cache around.

You could try two things: i) structure your data more sensibly (eg. 
hierarchical directories)   ii) use only one gridftp server, eg name one 
of the servers explicitly like tg-s008.uc.teragrid.org rather than using 
the tg-gridftp name which goes to 8 different servers.

-- 


From leggett at ci.uchicago.edu  Wed Nov  7 15:19:06 2007
From: leggett at ci.uchicago.edu (Ti Leggett)
Date: Wed, 7 Nov 2007 15:19:06 -0600
Subject: [Swift-devel] Re: Please re-send info on GridFTP servers for CI SAN
In-Reply-To: <473208BC.3040803@mcs.anl.gov>
References: <473208BC.3040803@mcs.anl.gov>
Message-ID: <9D503713-1C9B-4CAC-980B-DF5DA31DFB2B@ci.uchicago.edu>

This should be working, there was an error in the gridftp  
configuration files.

On Nov 7, 2007, at 12:49 PM, Michael Wilde wrote:

> Ti, a question for our SC analytics challenge:
>
> Can you re-post the info you sent a while back on the gridftp server  
> on the CI SAN?
>
> We can access the space from the tp-osg gridftp server; when I tried  
> to do so yesterday I ran into errors.  I didnt record, because it  
> wasnt clear to me if I was using the right URL to contact it. (I  
> tried stor.ci.uchicago.edu in at least one run).  Then I tried  
> stor1.  I think stor failed and stor1 hung.
>
> Before I try to capture all this and post here for debugging, can  
> you tell us what the correct URL is to use the server, and any other  
> considerations for striped transfer?
>
> Should stor be a faster server than tp-osg for this data, or same?
>
> can you post a few how-to notes on this to the CI wiki page related  
> to using the SAN?
>
> Thanks,
>
> Mike
>


From benc at hawaga.org.uk  Thu Nov  8 09:41:35 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 8 Nov 2007 15:41:35 +0000 (GMT)
Subject: [Swift-devel] job success in the presence of massive brokenness
Message-ID: <Pine.LNX.4.64.0711081538570.5551@dildano.hawaga.org.uk>


I have been interested to see how over the past few days fairly large job 
failure rates (on the order of 30%) have still been able to run successful 
job completions and (with appropriately hacked scheduler) not get stuck at 
an appallingly slow completion rate.

Eventually when some real investigation happens in the scheduler, putting 
in articificially broken job/file transfer submission to see how things 
perform will be an interesting thing to do.

-- 


From wilde at mcs.anl.gov  Fri Nov  9 07:50:41 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 09 Nov 2007 07:50:41 -0600
Subject: [Swift-devel] timing stats from run194
Message-ID: <473465B1.1080901@mcs.anl.gov>

the attached list of runtimes from angle run194 is interesting - there 
is quite a variance. one can see how "unlucky" clusters of jobs will hit 
the time limit.

only thing i can think of is to tweak the limit up.

Would be good to plot:
- displays of runtime range etc
- runtime vs input file size (kickstart can tell us this but it takes 
coordination with the kicktart caller that i dont think swift does yet)


-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: kickstart.summary.byruntime
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20071109/c2820485/attachment.ksh>

From benc at hawaga.org.uk  Fri Nov  9 07:58:26 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 9 Nov 2007 13:58:26 +0000 (GMT)
Subject: [Swift-devel] Re: timing stats from run194
In-Reply-To: <473465B1.1080901@mcs.anl.gov>
References: <473465B1.1080901@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0711091357270.30755@dildano.hawaga.org.uk>


On Fri, 9 Nov 2007, Michael Wilde wrote:

> the attached list of runtimes from angle run194 is interesting - there is
> quite a variance. one can see how "unlucky" clusters of jobs will hit the time
> limit.
> 
> only thing i can think of is to tweak the limit up.
> 
> Would be good to plot:

> - displays of runtime range etc

That gets plotted already - I just sent you that. The 90s wall time limit 
is capturing maybe 70% of successful jobs.

http://www.ci.uchicago.edu/~benc/log-processing/report-20071108-2248-okx1odlc/kickstart-duration-histogram.png

-- 


From benc at hawaga.org.uk  Fri Nov  9 08:06:46 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 9 Nov 2007 14:06:46 +0000 (GMT)
Subject: [Swift-devel] Re: timing stats from run194
In-Reply-To: <Pine.LNX.4.64.0711091357270.30755@dildano.hawaga.org.uk>
References: <473465B1.1080901@mcs.anl.gov>
	<Pine.LNX.4.64.0711091357270.30755@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0711091406070.30755@dildano.hawaga.org.uk>

I suspect what may be happening in the most recent run is that a bunch of 
long jobs are accumulating for retry, having failed earlier due to 
walltimes, and are now spending forever over and over running out of 
walltime.

-- 


From wilde at mcs.anl.gov  Fri Nov  9 08:19:19 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 09 Nov 2007 08:19:19 -0600
Subject: [Swift-devel] Re: timing stats from run194
In-Reply-To: <Pine.LNX.4.64.0711091406070.30755@dildano.hawaga.org.uk>
References: <473465B1.1080901@mcs.anl.gov>
	<Pine.LNX.4.64.0711091357270.30755@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0711091406070.30755@dildano.hawaga.org.uk>
Message-ID: <47346C67.4080501@mcs.anl.gov>

Ah - a perfectly logical explanation, and a hard case to handle with 
retry.  Perhaps the retry mechanism should be taught to recognize 
over-walltime errors and bump up the walltime for the failures based on 
per-application settings.

On 11/9/07 8:06 AM, Ben Clifford wrote:
> I suspect what may be happening in the most recent run is that a bunch of 
> long jobs are accumulating for retry, having failed earlier due to 
> walltimes, and are now spending forever over and over running out of 
> walltime.
> 


From benc at hawaga.org.uk  Fri Nov  9 08:24:50 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 9 Nov 2007 14:24:50 +0000 (GMT)
Subject: [Swift-devel] Re: timing stats from run194
In-Reply-To: <47346C67.4080501@mcs.anl.gov>
References: <473465B1.1080901@mcs.anl.gov>
	<Pine.LNX.4.64.0711091357270.30755@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0711091406070.30755@dildano.hawaga.org.uk>
	<47346C67.4080501@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0711091421530.30755@dildano.hawaga.org.uk>


On Fri, 9 Nov 2007, Michael Wilde wrote:

> Ah - a perfectly logical explanation, and a hard case to handle with retry.
> Perhaps the retry mechanism should be taught to recognize over-walltime errors
> and bump up the walltime for the failures based on per-application settings.

well, that's not really the semantics of maxwalltime - you as the 
application user assert in your maxwalltime spec that it is an error for 
your jobs to take longer than that.

it is perhaps bad to allow one job breaking that assertion to cause a 
clusterful of jobs to fail.

it may also be more sensible in the case of widely varying loads to 
specify the clusteriness in terms of jobs-per-cluster rather than the 
present maxwalltime based approach.

exciting application-specific estimation of appropriate maxwalltimes for 
invocations, rather than for all invocations of an app - based (eg) on 
input file or other parameters is an option to also investigate in the 
future.

-- 


From wilde at mcs.anl.gov  Fri Nov  9 08:47:26 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 09 Nov 2007 08:47:26 -0600
Subject: [Swift-devel] Emails from LSF from NCSA tungsten: normal or problem?
Message-ID: <473472FE.9050709@mcs.anl.gov>

I get tens to hundres of these from tungsten.
I dont see an error here - are these normal and if so can i prevent the 
email?

If it indicates an error, what is this telling me?

(the short runtime at th end is suspcicious but other than that I dont 
see an error in here)


-------- Original Message --------
Subject: Job 1110502: <#! /bin/sh;#;# LSF batch job script built by 
Globus 4.0.1-r3 Job Manager;#;#BSUB -i /dev/null;#BSUB -o 
/dev/null;#BSUB -e /dev/null;#BSUB -N;#BSUB -n 
1;X509_USER_PROXY=/u/ac/wilde/.globus/job/tund.ncsa.uiuc.edu/21595.1194618548/x509> 
Done
Date: Fri, 9 Nov 2007 08:36:10 -0600
From: LSF <lsfadmin at ncsa.uiuc.edu>
To: wilde at ncsa.uiuc.edu

Job <#! /bin/sh;#;# LSF batch job script built by Globus 4.0.1-r3 Job 
Manager;#;#BSUB -i /dev/null;#BSUB -o /dev/null;#BSUB -e /dev/null;#BSUB 
-N;#BSUB -n 
1;X509_USER_PROXY=/u/ac/wilde/.globus/job/tund.ncsa.uiuc.edu/21595.1194618548/x509> 
was submitted from host <tund> by user <wilde>.
Job was executed on host(s) <tune224>, in queue <normal>, as user <wilde>.
</u/ac/wilde> was used as the home directory.
</u/ac/wilde> was used as the working directory.
Started at Fri Nov  9 08:35:49 2007
Results reported at Fri Nov  9 08:36:10 2007

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
#! /bin/sh
#
# LSF batch job script built by Globus 4.0.1-r3 Job Manager
#
#BSUB -i /dev/null
#BSUB -o /dev/null
#BSUB -e /dev/null
#BSUB -N
#BSUB -n 1
X509_USER_PROXY=/u/ac/wilde/.globus/job/tund.ncsa.uiuc.edu/21595.1194618548/x509_up; 
export X509_USER_PROXY
GLOBUS_LOCATION=/usr/local/prews-gram-4.0.1-r3/; export GLOBUS_LOCATION
GLOBUS_GRAM_JOB_CONTACT=https://tund.ncsa.uiuc.edu:50031/21595/1194618548/; 
export GLOBUS_GRAM_JOB_CONTACT
GLOBUS_GRAM_MYJOB_CONTACT=URLx-nexus://tund.ncsa.uiuc.edu:50032/; export 
GLOBUS_GRAM_MYJOB_CONTACT
HOME=/u/ac/wilde; export HOME
LOGNAME=wilde; export LOGNAME

	if test 'X${LD_LIBRARY_PATH}' != 'X'; then
	    LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:"
	else
	    LD_LIBRARY_PATH=""
	fi
	export LD_LIBRARY_PATH

#Change to directory requested by user
cd /scratch/users/wilde/swiftdata/awf8-20071109-0827-it230m61
. /usr/lsf/conf/profile.lsf && 
/usr/lsf/6.0/linux2.4-glibc2.3-x86/bin/lsgrun -p -m "$LSB_HOSTS" /bin/sh 
"shared/wrapper.sh" "angle4-ggxk5uji" "-jobdir" "g" "-e" 
"/u/ac/wilde/angle/bin/angle4.sh" "-out" "stdout.txt" "-err" 
"stderr.txt" "-i" "-d" 
"disks/ci-gpfs/angle/spool_1|_output/of/spool_1|_output/cf/spool_1" 
"-if" "/disks/ci-gpfs/angle/spool_1/ncdm2-1182355200-dump.1.272.pcap.gz" 
"-of" 
"_output/of/spool_1/of.ncdm2-1182355200-dump.1.272.angle|_output/cf/spool_1/cf.ncdm2-1182355200-dump.1.272.center" 
"-k" "/u/ac/wilde/swift/tools/mystart" "-a" 
"disks/ci-gpfs/angle/spool_1/ncdm2-1182355200-dump.1.272.pcap.gz" 
"_output/of/spool_1/of.ncdm2-1182355200-dump.1.272.angle" 
"_output/cf/spool_1/cf.ncdm2-1182355200-dump.1.272.center"  2> /dev/null

------------------------------------------------------------

Successfully completed.

Resource usage summary:

     CPU time   :      0.71 sec.
     Max Memory :         3 MB
     Max Swap   :         6 MB

     Max Processes  :         1
     Max Threads    :         1


From benc at hawaga.org.uk  Fri Nov  9 08:52:39 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 9 Nov 2007 14:52:39 +0000 (GMT)
Subject: [Swift-devel] Emails from LSF from NCSA tungsten: normal or
	problem?
In-Reply-To: <473472FE.9050709@mcs.anl.gov>
References: <473472FE.9050709@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0711091451330.20932@dildano.hawaga.org.uk>


On Fri, 9 Nov 2007, Michael Wilde wrote:

> I get tens to hundres of these from tungsten.
> I dont see an error here - are these normal and if so can i prevent the email?
> 
> If it indicates an error, what is this telling me?
> 
> (the short runtime at th end is suspcicious but other than that I dont see an
> error in here)

It ends with: Successfully completed.                                                         
so my first guess is that its a success message. Not sure how to make 
those go away.

-- 


From wilde at mcs.anl.gov  Fri Nov  9 09:14:29 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 09 Nov 2007 09:14:29 -0600
Subject: [Swift-devel] Emails from LSF from NCSA tungsten: normal or
	problem?
In-Reply-To: <Pine.LNX.4.64.0711091451330.20932@dildano.hawaga.org.uk>
References: <473472FE.9050709@mcs.anl.gov>
	<Pine.LNX.4.64.0711091451330.20932@dildano.hawaga.org.uk>
Message-ID: <47347955.3010306@mcs.anl.gov>

what puzzles me is the very short runtime - unless thats the cpu time of 
the wrapper script, not including the cpu time of its kids.  which would 
be odd since it seems to know the number of processes that were running 
(if thats what the "Max Processes  :6" means.)

So I will ignore for now.

On 11/9/07 8:52 AM, Ben Clifford wrote:
> 
> On Fri, 9 Nov 2007, Michael Wilde wrote:
> 
>> I get tens to hundres of these from tungsten.
>> I dont see an error here - are these normal and if so can i prevent the email?
>>
>> If it indicates an error, what is this telling me?
>>
>> (the short runtime at th end is suspcicious but other than that I dont see an
>> error in here)
> 
> It ends with: Successfully completed.                                                         
> so my first guess is that its a success message. Not sure how to make 
> those go away.
> 


From wilde at mcs.anl.gov  Tue Nov 13 04:30:48 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 13 Nov 2007 04:30:48 -0600
Subject: [Swift-devel] clustering problem:
Message-ID: <47397CD8.505@mcs.anl.gov>

I suspect a problem in clustering.

I had the following entries in tc.data:

UC      angle           /home/wilde/angle32/bin/angle.multiarch.sh 
INSTALLED       INTEL32::LINUX  GLOBUS::maxwalltime=20;
sdsc    angle           /users/ux454325/angle/bin/angle.sh 
INSTALLED       INTEL32::LINUX  GLOBUS::maxwalltime=20;
tungsten        angle           /u/ac/wilde/angle/bin/angle.sh 
INSTALLED       INTEL32::LINUX  GLOBUS::maxwalltime=20;
teraport        angle           /home/wilde/angle/bin/angle.sh 
INSTALLED       INTEL32::LINUX  GLOBUS::maxwalltime=20;
mercury         angle           /home/ncsa/wilde/angle/bin/angle.sh 
INSTALLED       INTEL32::LINUX  GLOBUS::maxwalltime=20;

and the following swift.properties:

kickstart.always.transfer=true

clustering.enabled=true
clustering.queue.delay=15
clustering.min.time=12000

throttle.transfers=64

sitedir.keep=true

lazy.errors=true

--
which when I ran a batch of 100 jobs, caused job manager failures and no 
jobs started.  the server side jobs, inf and status dirs were empty.

No jobs would show up in the PBS queue.

I found the following in the serve-side gram logs:

gram_job_mgr_1000.log:11/13 03:36:04 JM: GT3 extended error message: 
GRAM_SCRIPT_GT3_FAILURE_MESSAGE:This job will be charged to account: brn 
(TG-CCR080001) qsub: Illegal attribute or resource value for 
Resource_List.walltime
gram_job_mgr_1000.log:11/13 03:36:04 JMI: while return_buf = 
GRAM_SCRIPT_ERROR = 17

--
when I changed maxwalltime to "00:05:00" and the properties to:

clustering.queue.delay=30
clustering.min.time=1200
throttle.transfers=16

things work, and all 100 jobs finish smoothly.

I suspect that something in my previous parameters is causing an invalid 
walltime to be sent to pbs.  Still digging into this but need help.


From wilde at mcs.anl.gov  Fri Nov 16 01:19:11 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 16 Nov 2007 01:19:11 -0600
Subject: [Swift-devel] pbs jobs lingering in completing state on uc-teragrid
Message-ID: <473D446F.9060501@mcs.anl.gov>

Hi Help Team,

A question for the Argonne TG group:

Starting Tue i saw for the first time that my jobs on uc-teragrid seemed 
to be lingering for a while in the "C" - completing - state.

Is this normal, or a new behavior, or just something I didnt notice 
before.  I dont have an exact time on how long they linger, butit seemed 
unusual.

Any thoughts on this, Ti?

- Mike


From wilde at mcs.anl.gov  Fri Nov 16 01:46:43 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 16 Nov 2007 01:46:43 -0600
Subject: [Swift-devel] swift status monitors
Message-ID: <473D4AE3.8030005@mcs.anl.gov>

Mihael, do you have something that we can use to display run status?

I was fiddling with a small curses-based tool to do this

( tail -f run*.log | shredlog | curse )

but Ben reminded me that you have something brewing.


From hategan at mcs.anl.gov  Fri Nov 16 21:02:02 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 16 Nov 2007 21:02:02 -0600
Subject: [Swift-devel] Re: swift status monitors
In-Reply-To: <473D4AE3.8030005@mcs.anl.gov>
References: <473D4AE3.8030005@mcs.anl.gov>
Message-ID: <1195268522.3714.7.camel@blabla.mcs.anl.gov>

On Fri, 2007-11-16 at 01:46 -0600, Michael Wilde wrote:
> Mihael, do you have something that we can use to display run status?
> 
> I was fiddling with a small curses-based tool to do this
> 
> ( tail -f run*.log | shredlog | curse )
> 
> but Ben reminded me that you have something brewing.

"brewing" is the right word.

> 


From benc at hawaga.org.uk  Tue Nov 20 00:04:59 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 20 Nov 2007 06:04:59 +0000 (GMT)
Subject: [Swift-devel] playing with array closing.
Message-ID: <Pine.LNX.4.64.0711200532340.23250@dildano.hawaga.org.uk>


I spent some time at the weekend and today playing with 'the array closing 
problem'.

The array-closing problem is what happens when we combine 
single-assignment semantics (which say that you will only write a=foo; 
once for each variable a) with our array assignment semantics (which say 
that arrays are populated by multiple assignments, a[0]=foo; a[1]=bar;).

Below, exhibit A, is a program which does not work in the present 
trunk implementation - instead it hangs after executing 
top-level statements R,S,T and before executing statement W.

Statement W will not be executed until the array name 'array' is closed, 
that is, until it is known that there are no further writes to the array.

So I prototyped some compile-time dataflow analysis (a bit like the 
present input marking code that already exists) to see that statements 
R,S,T write (or potentially write to) 'array' and that no other statements 
do.

Armed with this knowledge, the compiled karajan code is modified so that:
 i) when datasets are created (using vdl:new) they are labelled with a 
list of statements that may write to them.
 ii) those statements are modified so that they notify the appropriate 
datasets when they have finished.

So each statement issues a partial close on the datasets it writes to, and 
each dataset is aware which partial closes to expect.

When a dataset has received partial closes (at runtime) from everything it 
is expecting (which is determined at compile time), it becomes fully 
closed.

In the example code, that means that statement W's dependency on the array 
being closed is now satisfied, and so it is executed, and so this workflow 
ends.

Its not so straightforward - for example, statement U writes to the array 
several times, and we don't want the first write to do the corresponding 
partial close. So the above processing happens only for statements in the 
same scope as the declaration. In the case of sub-scopes, such as inside a 
foreach, partial closes don't happen, but the enclosing statement (foreach 
in the example below) are treated as a single statement which completes 
and closes only when the whole loop is finished.

I think this is the right approach to pursue for this problem.

Also, I think that this implementation could join up with the present 
dataset marking code (which is used to determine what is an input and what 
is not), and also be used for better compile time type checking and 
related things (eg. checking for variables declared multiple times, 
variables assigned to multiple times when they shouldn't be, ...)

==== EXHIBIT A, being a program which does not work in the present trunk 
implementation ====
type file;

(file f) writefile(int s) {
  app {
    echo s stdout=@f;
  }
}


(file f) listvals(file array[]) {
  app {
    echo @filenames(array) stdout=@f;
  }
}

file array[];                    (Q)

array[0]=writefile(99999);       (R)
array[1]=writefile(10000);       (S)

foreach i in [2:5] {             (T)
  array[i]=writefile(i+80);      (U)
}

file out <"out">;                (V)

out = listvals(array);           (W)


From hategan at mcs.anl.gov  Tue Nov 20 00:42:35 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 20 Nov 2007 00:42:35 -0600
Subject: [Swift-devel] playing with array closing.
In-Reply-To: <Pine.LNX.4.64.0711200532340.23250@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0711200532340.23250@dildano.hawaga.org.uk>
Message-ID: <1195540956.971.5.camel@blabla.mcs.anl.gov>

I'm thinking...

It may be ok to deal with array closing lexically instead of in a
dataflow way. In other words close the array after the last lexical
write (the scoping problem you mention still remains, but seems ok).
This may simplify the implementation and have less memory overhead.

The downside is that some corner cases may still break (e.g. calling
listvals(array) from inside the foreach - though maybe that breaks
anyway).

On Tue, 2007-11-20 at 06:04 +0000, Ben Clifford wrote:
> I spent some time at the weekend and today playing with 'the array closing 
> problem'.
> 
> The array-closing problem is what happens when we combine 
> single-assignment semantics (which say that you will only write a=foo; 
> once for each variable a) with our array assignment semantics (which say 
> that arrays are populated by multiple assignments, a[0]=foo; a[1]=bar;).
> 
> Below, exhibit A, is a program which does not work in the present 
> trunk implementation - instead it hangs after executing 
> top-level statements R,S,T and before executing statement W.
> 
> Statement W will not be executed until the array name 'array' is closed, 
> that is, until it is known that there are no further writes to the array.
> 
> So I prototyped some compile-time dataflow analysis (a bit like the 
> present input marking code that already exists) to see that statements 
> R,S,T write (or potentially write to) 'array' and that no other statements 
> do.
> 
> Armed with this knowledge, the compiled karajan code is modified so that:
>  i) when datasets are created (using vdl:new) they are labelled with a 
> list of statements that may write to them.
>  ii) those statements are modified so that they notify the appropriate 
> datasets when they have finished.
> 
> So each statement issues a partial close on the datasets it writes to, and 
> each dataset is aware which partial closes to expect.
> 
> When a dataset has received partial closes (at runtime) from everything it 
> is expecting (which is determined at compile time), it becomes fully 
> closed.
> 
> In the example code, that means that statement W's dependency on the array 
> being closed is now satisfied, and so it is executed, and so this workflow 
> ends.
> 
> Its not so straightforward - for example, statement U writes to the array 
> several times, and we don't want the first write to do the corresponding 
> partial close. So the above processing happens only for statements in the 
> same scope as the declaration. In the case of sub-scopes, such as inside a 
> foreach, partial closes don't happen, but the enclosing statement (foreach 
> in the example below) are treated as a single statement which completes 
> and closes only when the whole loop is finished.
> 
> I think this is the right approach to pursue for this problem.
> 
> Also, I think that this implementation could join up with the present 
> dataset marking code (which is used to determine what is an input and what 
> is not), and also be used for better compile time type checking and 
> related things (eg. checking for variables declared multiple times, 
> variables assigned to multiple times when they shouldn't be, ...)
> 
> ==== EXHIBIT A, being a program which does not work in the present trunk 
> implementation ====
> type file;
> 
> (file f) writefile(int s) {
>   app {
>     echo s stdout=@f;
>   }
> }
> 
> 
> (file f) listvals(file array[]) {
>   app {
>     echo @filenames(array) stdout=@f;
>   }
> }
> 
> file array[];                    (Q)
> 
> array[0]=writefile(99999);       (R)
> array[1]=writefile(10000);       (S)
> 
> foreach i in [2:5] {             (T)
>   array[i]=writefile(i+80);      (U)
> }
> 
> file out <"out">;                (V)
> 
> out = listvals(array);           (W)
> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From benc at hawaga.org.uk  Tue Nov 20 00:54:37 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 20 Nov 2007 06:54:37 +0000 (GMT)
Subject: [Swift-devel] playing with array closing.
In-Reply-To: <1195540956.971.5.camel@blabla.mcs.anl.gov>
References: <Pine.LNX.4.64.0711200532340.23250@dildano.hawaga.org.uk>
	<1195540956.971.5.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0711200645270.23250@dildano.hawaga.org.uk>


On Tue, 20 Nov 2007, Mihael Hategan wrote:

> It may be ok to deal with array closing lexically instead of in a
> dataflow way. In other words close the array after the last lexical
> write (the scoping problem you mention still remains, but seems ok).
> This may simplify the implementation and have less memory overhead.

That's pretty much what this is. Lexical treatment at compile time. But I 
think there needs to be some runtime join of the various statements 
because they don't get executed (or rather, don't complete) in lexical 
order.

> The downside is that some corner cases may still break (e.g. calling
> listvals(array) from inside the foreach - though maybe that breaks
> anyway).

That works in what I did if the loop body doesn't also assign to the 
array. However, it has deadlock problems if the loop body both assigns to 
the array and reads from it.

With more complication at runtime, I think thats rectifiable - rather than 
partially-closing after the loop entirely finishes, should be possible to 
track which pieces of the inner loop have run and close after the 
appropriate statements have been run for each iteration.

-- 


From hategan at mcs.anl.gov  Tue Nov 20 09:44:27 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 20 Nov 2007 09:44:27 -0600
Subject: [Swift-devel] playing with array closing.
In-Reply-To: <Pine.LNX.4.64.0711200645270.23250@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0711200532340.23250@dildano.hawaga.org.uk>
	<1195540956.971.5.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0711200645270.23250@dildano.hawaga.org.uk>
Message-ID: <1195573468.2022.6.camel@blabla.mcs.anl.gov>


On Tue, 2007-11-20 at 06:54 +0000, Ben Clifford wrote:
> 
> On Tue, 20 Nov 2007, Mihael Hategan wrote:
> 
> > It may be ok to deal with array closing lexically instead of in a
> > dataflow way. In other words close the array after the last lexical
> > write (the scoping problem you mention still remains, but seems ok).
> > This may simplify the implementation and have less memory overhead.
> 
> That's pretty much what this is. Lexical treatment at compile time. But I 
> think there needs to be some runtime join of the various statements 
> because they don't get executed (or rather, don't complete) in lexical 
> order.

I thought you can reorder them, but it may be difficult if a single
statement writes to multiple arrays (such as the foreach).

> 
> > The downside is that some corner cases may still break (e.g. calling
> > listvals(array) from inside the foreach - though maybe that breaks
> > anyway).
> 
> That works in what I did if the loop body doesn't also assign to the 
> array. However, it has deadlock problems if the loop body both assigns to 
> the array and reads from it.
> 
> With more complication at runtime, I think thats rectifiable - rather than 
> partially-closing after the loop entirely finishes, should be possible to 
> track which pieces of the inner loop have run and close after the 
> appropriate statements have been run for each iteration.
> 


From benc at hawaga.org.uk  Tue Nov 20 10:41:54 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 20 Nov 2007 16:41:54 +0000 (GMT)
Subject: [Swift-devel] playing with array closing.
In-Reply-To: <1195573468.2022.6.camel@blabla.mcs.anl.gov>
References: <Pine.LNX.4.64.0711200532340.23250@dildano.hawaga.org.uk> 
	<1195540956.971.5.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0711200645270.23250@dildano.hawaga.org.uk>
	<1195573468.2022.6.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0711201626490.23250@dildano.hawaga.org.uk>


On Tue, 20 Nov 2007, Mihael Hategan wrote:

> I thought you can reorder them, but it may be difficult if a single
> statement writes to multiple arrays (such as the foreach).

Source text order is a linear order; it is possible to flatten any DAG 
into a linear order, but loses some of the information in the DAG.

I thought a bit before about trying to reorder dataset delcarations in the 
compiled code with respect to execution statements, to try to get mapper 
parameters computed before they are used; I think what happened was that 
this introduced unnecessary serialisation (though maybe its possible with 
a suitably large mix of parallel and sequential blocks).

-- 


From bugzilla-daemon at mcs.anl.gov  Tue Nov 20 11:04:23 2007
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Tue, 20 Nov 2007 11:04:23 -0600 (CST)
Subject: [Swift-devel] [Bug 112] New: error reporting in procedure
	declarations
Message-ID: <bug-112-21@http.bugzilla.mcs.anl.gov/swift/>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=112

           Summary: error reporting in procedure declarations
           Product: Swift
           Version: unspecified
          Platform: Macintosh
        OS/Version: Mac OS
            Status: NEW
          Severity: normal
          Priority: P2
         Component: SwiftScript language
        AssignedTo: benc at hawaga.org.uk
        ReportedBy: benc at hawaga.org.uk
                CC: swift-devel at ci.uchicago.edu


The below code fragment has poor error reporting in r1471 - the procedure
declaration is invalid because 'stdout' is a keyword and cannot be used as a
variable name. The parser predictor for procedure declarations predicts based
on an entire valid procedure declaration being present, so gives a very poor
error message.

Predictor can be shortened - perhaps to  left-bracket token token

(messagefile stdout, messagefile b) greeting(string m) {
app { 
echo m stdout=@filename(a) stderr=@filename(b);
}
}


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


From hategan at mcs.anl.gov  Tue Nov 20 11:08:03 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 20 Nov 2007 11:08:03 -0600
Subject: [Swift-devel] playing with array closing.
In-Reply-To: <Pine.LNX.4.64.0711201626490.23250@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0711200532340.23250@dildano.hawaga.org.uk>
	<1195540956.971.5.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0711200645270.23250@dildano.hawaga.org.uk>
	<1195573468.2022.6.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0711201626490.23250@dildano.hawaga.org.uk>
Message-ID: <1195578483.5392.3.camel@blabla.mcs.anl.gov>


On Tue, 2007-11-20 at 16:41 +0000, Ben Clifford wrote: 
> On Tue, 20 Nov 2007, Mihael Hategan wrote:
> 
> > I thought you can reorder them, but it may be difficult if a single
> > statement writes to multiple arrays (such as the foreach).
> 
> Source text order is a linear order; it is possible to flatten any DAG 
> into a linear order, but loses some of the information in the DAG.
> 
> I thought a bit before about trying to reorder dataset delcarations in the 
> compiled code with respect to execution statements, to try to get mapper 
> parameters computed before they are used; I think what happened was that 
> this introduced unnecessary serialisation (though maybe its possible with 
> a suitably large mix of parallel and sequential blocks).

It isn't in the general case.

The "most relevant link" is a book, but see
http://dx.doi.org/10.1016/0304-3975(94)00272-X for some info.

Basically parallel (independent) and sequential (linear) blocks are not
sufficient to provide a decomposition of an arbitrary dag.

The article talks about graphs with no primitive structures (those that
can be decomposed using only seq/par).

> 


From bugzilla-daemon at mcs.anl.gov  Tue Nov 20 11:19:03 2007
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Tue, 20 Nov 2007 11:19:03 -0600 (CST)
Subject: [Swift-devel] [Bug 113] New: restarts broken in r1471
Message-ID: <bug-113-21@http.bugzilla.mcs.anl.gov/swift/>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=113

           Summary: restarts broken in r1471
           Product: Swift
           Version: unspecified
          Platform: Macintosh
        OS/Version: Mac OS
            Status: NEW
          Severity: normal
          Priority: P2
         Component: General
        AssignedTo: hategan at mcs.anl.gov
        ReportedBy: benc at hawaga.org.uk


Restarts don't work (at all?) in r1471.

Any example restart log might contain:

file://localhost/_concurrent/array-8b41edd4-8cd0-4c09-9e15-da87472c860e--array//elt-4/localhost/arrayclosehang-20071120-0915-tb5pd4ga/shared/elt-4

but on restart, execution appears to look for:
... elt-4/localhost/arrayclosehang-20071120-0917-uoh2opve/shared/elt-4

with a different run-id in the working directory name. so all work is done
again.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.


From bugzilla-daemon at mcs.anl.gov  Tue Nov 20 12:42:34 2007
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Tue, 20 Nov 2007 12:42:34 -0600 (CST)
Subject: [Swift-devel] [Bug 113] restarts broken in r1471
In-Reply-To: <bug-113-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20071120184234.0A38A164BB@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=113


benc at hawaga.org.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |DUPLICATE


------- Comment #1 from benc at hawaga.org.uk  2007-11-20 12:42 -------
oops

*** This bug has been marked as a duplicate of 107 ***


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.


From bugzilla-daemon at mcs.anl.gov  Tue Nov 20 12:42:35 2007
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Tue, 20 Nov 2007 12:42:35 -0600 (CST)
Subject: [Swift-devel] [Bug 107] restarts broken (by generalisation of data
	file handling)
In-Reply-To: <bug-107-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20071120184235.3AD6D16505@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=107


------- Comment #1 from benc at hawaga.org.uk  2007-11-20 12:42 -------
*** Bug 113 has been marked as a duplicate of this bug. ***


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.


From bugzilla-daemon at mcs.anl.gov  Tue Nov 20 13:07:06 2007
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Tue, 20 Nov 2007 13:07:06 -0600 (CST)
Subject: [Swift-devel] [Bug 11] nested {} blocks do not cause nested
	variable scopes
In-Reply-To: <bug-11-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20071120190706.362A6164EC@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=11


benc at hawaga.org.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |swift-devel at ci.uchicago.edu
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #2 from benc at hawaga.org.uk  2007-11-20 13:07 -------
In r1486, I remove nested compound blocks entirely from the language - they
appear to have never been used outside of unit tests.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


From bugzilla-daemon at mcs.anl.gov  Tue Nov 20 13:30:17 2007
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Tue, 20 Nov 2007 13:30:17 -0600 (CST)
Subject: [Swift-devel] [Bug 39] a poor syntax error
In-Reply-To: <bug-39-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20071120193017.CA851164BB@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=39


------- Comment #1 from benc at hawaga.org.uk  2007-11-20 13:30 -------
the parser is interpreting the > as the greater-than operator in an expression:

"econ_prob_list.txt" > results  

which is syntactically valid, rather than as the termination of the mapper
declaration. This makes it get a few tokens further along in parsing than
desired in this error reporting case.

Use of > for both termination of mapper declaration and as a valid
in-declaration token is the root cause here, I think.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You reported the bug, or are watching the reporter.
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at mcs.anl.gov  Tue Nov 20 14:48:01 2007
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Tue, 20 Nov 2007 14:48:01 -0600 (CST)
Subject: [Swift-devel] [Bug 11] nested {} blocks do not cause nested
	variable scopes
In-Reply-To: <bug-11-21@http.bugzilla.mcs.anl.gov/swift/>
Message-ID: <20071120204801.F100C164EC@foxtrot.mcs.anl.gov>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=11


------- Comment #3 from hategan at mcs.anl.gov  2007-11-20 14:48 -------
The fact that they were never used outside unit tests, doesn't mean that there
is not value to them. On the other hand they may not be worth spending time on.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


From benc at hawaga.org.uk  Thu Nov 22 13:10:58 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 22 Nov 2007 19:10:58 +0000 (GMT)
Subject: [Swift-devel] multiple declarations of variables.
Message-ID: <Pine.LNX.4.64.0711221907560.23250@dildano.hawaga.org.uk>


At present, the language allows multiple declarations using the same 
variable name resulting in the second variable shadowing the first.

eg:

> file i;
> file i;

(which is fairly obvious)

but also:

> file foo <"myfile">;
> file foo = f(x);

I've seen multiple declarations like this confuse a few people in that 
past.

I'd like to make variable shadowing illegal - either of the above should 
result in a compile time error (or at least a warning).

-- 


From hategan at mcs.anl.gov  Thu Nov 22 13:54:31 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 22 Nov 2007 13:54:31 -0600
Subject: [Swift-devel] multiple declarations of variables.
In-Reply-To: <Pine.LNX.4.64.0711221907560.23250@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0711221907560.23250@dildano.hawaga.org.uk>
Message-ID: <1195761272.30276.0.camel@blabla.mcs.anl.gov>

On Thu, 2007-11-22 at 19:10 +0000, Ben Clifford wrote:
> At present, the language allows multiple declarations using the same 
> variable name resulting in the second variable shadowing the first.
> 
> eg:
> 
> > file i;
> > file i;
> 
> (which is fairly obvious)
> 
> but also:
> 
> > file foo <"myfile">;
> > file foo = f(x);
> 
> I've seen multiple declarations like this confuse a few people in that 
> past.
> 
> I'd like to make variable shadowing illegal - either of the above should 
> result in a compile time error (or at least a warning).

Error. Even Java and C do that.

> 


From wilde at mcs.anl.gov  Fri Nov 23 14:44:50 2007
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 23 Nov 2007 14:44:50 -0600
Subject: [Swift-devel] multiple declarations of variables.
In-Reply-To: <Pine.LNX.4.64.0711221907560.23250@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0711221907560.23250@dildano.hawaga.org.uk>
Message-ID: <47473BC2.8030204@mcs.anl.gov>

that sounds good.

On 11/22/07 1:10 PM, Ben Clifford wrote:
> At present, the language allows multiple declarations using the same 
> variable name resulting in the second variable shadowing the first.
> 
> eg:
> 
>> file i;
>> file i;
> 
> (which is fairly obvious)
> 
> but also:
> 
>> file foo <"myfile">;
>> file foo = f(x);
> 
> I've seen multiple declarations like this confuse a few people in that 
> past.
> 
> I'd like to make variable shadowing illegal - either of the above should 
> result in a compile time error (or at least a warning).
> 


From hategan at mcs.anl.gov  Fri Nov 23 15:42:04 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 23 Nov 2007 15:42:04 -0600
Subject: [Swift-devel] SSH support
Message-ID: <1195854124.12780.7.camel@blabla.mcs.anl.gov>

I've updated the SSH provider in cog to do a few things:
- make better use of connections (cache them). SSH has this nifty thing:
On one connection you can configure multiple independent channels
(OpenSSH servers seem to support up to 10 such channels per connection).
With this you get up to 10 independent shells without authenticating
again.
- access remote filesystems (a file op provider) with SFTP
- get default authentication information from a file
(~/.ssh/auth.defaults). I attached a sample. I need to document this.

I also added a filesystem element in the site catalog, which works in a
similar way to the execution element:
 <pool handle="plussed" sysinfo="INTEL32::LINUX">
    <filesystem provider="ssh" url="plussed.mcs.anl.gov"
storage="/homes/hategan/tmp" />
    <execution provider="ssh" url="plussed.mcs.anl.gov" />
    <workdirectory>/homes/hategan/tmp</workdirectory>
  </pool>

That basically allows Swift to work with SSH.
-------------- next part --------------
localhost.type=key
localhost.username=mike
localhost.key=/home/mike/.ssh/identity
localhost.passphrase=
plussed.mcs.anl.gov.type=key
plussed.mcs.anl.gov.username=hategan
plussed.mcs.anl.gov.key=/home/mike/.ssh/identity
plussed.mcs.anl.gov.passphrase=

From benc at hawaga.org.uk  Fri Nov 23 19:07:53 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sat, 24 Nov 2007 01:07:53 +0000 (GMT)
Subject: [Swift-devel] SSH support
In-Reply-To: <1195854124.12780.7.camel@blabla.mcs.anl.gov>
References: <1195854124.12780.7.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0711240107260.28498@dildano.hawaga.org.uk>

can it use ssh-agent authentication? when I looked at the ssh code a while 
ago it didn't seem to want to.
-- 


From hategan at mcs.anl.gov  Fri Nov 23 19:14:40 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 23 Nov 2007 19:14:40 -0600
Subject: [Swift-devel] SSH support
In-Reply-To: <Pine.LNX.4.64.0711240107260.28498@dildano.hawaga.org.uk>
References: <1195854124.12780.7.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0711240107260.28498@dildano.hawaga.org.uk>
Message-ID: <1195866880.20322.1.camel@blabla.mcs.anl.gov>


On Sat, 2007-11-24 at 01:07 +0000, Ben Clifford wrote:
> can it use ssh-agent authentication?

There have been long discussions about that. The ssh agent seems to use
some UNIX specific mechanisms to interact with ssh, so it's a bit weird
from Java. But I never really looked into the issue in sufficient
detail. I think I should.

>  when I looked at the ssh code a while 
> ago it didn't seem to want to.


From benc at hawaga.org.uk  Fri Nov 23 19:40:37 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sat, 24 Nov 2007 01:40:37 +0000 (GMT)
Subject: [Swift-devel] SSH support
In-Reply-To: <1195866880.20322.1.camel@blabla.mcs.anl.gov>
References: <1195854124.12780.7.camel@blabla.mcs.anl.gov> 
	<Pine.LNX.4.64.0711240107260.28498@dildano.hawaga.org.uk>
	<1195866880.20322.1.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0711240136510.23250@dildano.hawaga.org.uk>


On Fri, 23 Nov 2007, Mihael Hategan wrote:

> There have been long discussions about that. The ssh agent seems to use
> some UNIX specific mechanisms to interact with ssh, so it's a bit weird
> from Java. But I never really looked into the issue in sufficient
> detail. I think I should.

right, it uses unix domain sockets. I have no idea what that looks like in 
Java - I think nothing standard at all. I think maybe ssh-agent is also 
version-specific (i.e. it operates only with the ssh client from the same 
release as the ssh-agent) so maybe its a rather folorn hope.

-- 


From hategan at mcs.anl.gov  Fri Nov 23 19:56:52 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 23 Nov 2007 19:56:52 -0600
Subject: [Swift-devel] SSH support
In-Reply-To: <Pine.LNX.4.64.0711240136510.23250@dildano.hawaga.org.uk>
References: <1195854124.12780.7.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0711240107260.28498@dildano.hawaga.org.uk>
	<1195866880.20322.1.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0711240136510.23250@dildano.hawaga.org.uk>
Message-ID: <1195869412.22390.3.camel@blabla.mcs.anl.gov>


On Sat, 2007-11-24 at 01:40 +0000, Ben Clifford wrote:
> On Fri, 23 Nov 2007, Mihael Hategan wrote:
> 
> > There have been long discussions about that. The ssh agent seems to use
> > some UNIX specific mechanisms to interact with ssh, so it's a bit weird
> > from Java. But I never really looked into the issue in sufficient
> > detail. I think I should.
> 
> right, it uses unix domain sockets. I have no idea what that looks like in 
> Java - I think nothing standard at all. I think maybe ssh-agent is also 
> version-specific (i.e. it operates only with the ssh client from the same 
> release as the ssh-agent) so maybe its a rather folorn hope.

There is a Java implementation, as far as I remember, of it (even in
j2ssh). Though I've never tried it.

However, there is also GSISSH. Also not sure what would take to get that
to work in the current scheme.

On the other hand, user generated key pairs can be very convenient. It
would certainly solve the problem of having to generate proxies on a
regular basis in a portal, for which it gets an A in
usability/convenience.

> 


From benc at hawaga.org.uk  Fri Nov 23 20:01:07 2007
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sat, 24 Nov 2007 02:01:07 +0000 (GMT)
Subject: [Swift-devel] SSH support
In-Reply-To: <1195869412.22390.3.camel@blabla.mcs.anl.gov>
References: <1195854124.12780.7.camel@blabla.mcs.anl.gov> 
	<Pine.LNX.4.64.0711240107260.28498@dildano.hawaga.org.uk> 
	<1195866880.20322.1.camel@blabla.mcs.anl.gov> 
	<Pine.LNX.4.64.0711240136510.23250@dildano.hawaga.org.uk>
	<1195869412.22390.3.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0711240200240.23250@dildano.hawaga.org.uk>


On Fri, 23 Nov 2007, Mihael Hategan wrote:

> On the other hand, user generated key pairs can be very convenient. It
> would certainly solve the problem of having to generate proxies on a
> regular basis in a portal, for which it gets an A in
> usability/convenience.

though if you're prepared to accept long term unencrypted credentials, 
making a proxy valid for the full length of its parent credntial is also a 
reasonable way to proceed.

-- 


From hategan at mcs.anl.gov  Fri Nov 23 20:15:09 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 23 Nov 2007 20:15:09 -0600
Subject: [Swift-devel] SSH support
In-Reply-To: <Pine.LNX.4.64.0711240200240.23250@dildano.hawaga.org.uk>
References: <1195854124.12780.7.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0711240107260.28498@dildano.hawaga.org.uk>
	<1195866880.20322.1.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0711240136510.23250@dildano.hawaga.org.uk>
	<1195869412.22390.3.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0711240200240.23250@dildano.hawaga.org.uk>
Message-ID: <1195870509.22925.10.camel@blabla.mcs.anl.gov>

On Sat, 2007-11-24 at 02:01 +0000, Ben Clifford wrote:
> 
> On Fri, 23 Nov 2007, Mihael Hategan wrote:
> 
> > On the other hand, user generated key pairs can be very convenient. It
> > would certainly solve the problem of having to generate proxies on a
> > regular basis in a portal, for which it gets an A in
> > usability/convenience.
> 
> though if you're prepared to accept long term unencrypted credentials, 
> making a proxy valid for the full length of its parent credntial is also a 
> reasonable way to proceed.

In a sense. One difference is that you can easily create a key pair to
be used for a specific application and specific sites, entirely separate
from an identity used to gain access to more critical things.

It's harder to get "application certs" from CAs that are accepted by
services on the typical servers we use.

> 


From hategan at mcs.anl.gov  Wed Nov 28 18:20:18 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 28 Nov 2007 18:20:18 -0600
Subject: [Swift-devel] transfers of small files
Message-ID: <1196295618.29963.10.camel@blabla.mcs.anl.gov>

So I've been playing with that issue. I've made some measurements
outside Swift. Here's a summary:

32k files. From terminable to tg-uc

1 - karajan with connection caching. transfers in parallel. tops at
200KB/s

2 - n*globus-url-copy - With 32 parallel transfers it starts failing and
gets about 10KB/s

3 - globus-url-copy with a list of files: around 300KB/s

4 - globus-url-copy with a list of files, E mode, and data channel
re-use: 500KB/s

So I figured I should hack the GridFTP provider to re-use data channels
by default. This is where it gets strange. I get averages (over multiple
runs) of over 1MB/s, with mins of about 130KB and max of 1.9MB/s, but
with a lot of variability. I'll debug this. However, I think there is
still value in enabling this by default.


From hategan at mcs.anl.gov  Wed Nov 28 18:23:21 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 28 Nov 2007 18:23:21 -0600
Subject: [Swift-devel] transfers of small files
In-Reply-To: <1196295618.29963.10.camel@blabla.mcs.anl.gov>
References: <1196295618.29963.10.camel@blabla.mcs.anl.gov>
Message-ID: <1196295801.29963.11.camel@blabla.mcs.anl.gov>

By contrast, multiple large files are transferred at a max of 11MB/s in
(1).

On Wed, 2007-11-28 at 18:20 -0600, Mihael Hategan wrote:
> So I've been playing with that issue. I've made some measurements
> outside Swift. Here's a summary:
> 
> 32k files. From terminable to tg-uc
> 
> 1 - karajan with connection caching. transfers in parallel. tops at
> 200KB/s
> 
> 2 - n*globus-url-copy - With 32 parallel transfers it starts failing and
> gets about 10KB/s
> 
> 3 - globus-url-copy with a list of files: around 300KB/s
> 
> 4 - globus-url-copy with a list of files, E mode, and data channel
> re-use: 500KB/s
> 
> So I figured I should hack the GridFTP provider to re-use data channels
> by default. This is where it gets strange. I get averages (over multiple
> runs) of over 1MB/s, with mins of about 130KB and max of 1.9MB/s, but
> with a lot of variability. I'll debug this. However, I think there is
> still value in enabling this by default.
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From foster at mcs.anl.gov  Wed Nov 28 18:24:54 2007
From: foster at mcs.anl.gov (Ian Foster)
Date: Wed, 28 Nov 2007 18:24:54 -0600
Subject: [Swift-devel] transfers of small files
In-Reply-To: <1196295618.29963.10.camel@blabla.mcs.anl.gov>
References: <1196295618.29963.10.camel@blabla.mcs.anl.gov>
Message-ID: <474E06D6.9030601@mcs.anl.gov>

Mihael:

It isn't clear to me--are you using the "lots of small files" 
optimization here?

I've CCed John Bresnahan so he can comment.

Ian.

Mihael Hategan wrote:
> So I've been playing with that issue. I've made some measurements
> outside Swift. Here's a summary:
>
> 32k files. From terminable to tg-uc
>
> 1 - karajan with connection caching. transfers in parallel. tops at
> 200KB/s
>
> 2 - n*globus-url-copy - With 32 parallel transfers it starts failing and
> gets about 10KB/s
>
> 3 - globus-url-copy with a list of files: around 300KB/s
>
> 4 - globus-url-copy with a list of files, E mode, and data channel
> re-use: 500KB/s
>
> So I figured I should hack the GridFTP provider to re-use data channels
> by default. This is where it gets strange. I get averages (over multiple
> runs) of over 1MB/s, with mins of about 130KB and max of 1.9MB/s, but
> with a lot of variability. I'll debug this. However, I think there is
> still value in enabling this by default.
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>   

-- 

   Ian Foster, Director, Computation Institute
Argonne National Laboratory & University of Chicago
Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu.
      Globus Alliance: www.globus.org.


From hategan at mcs.anl.gov  Wed Nov 28 18:31:58 2007
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 28 Nov 2007 18:31:58 -0600
Subject: [Swift-devel] transfers of small files
In-Reply-To: <474E06D6.9030601@mcs.anl.gov>
References: <1196295618.29963.10.camel@blabla.mcs.anl.gov>
	<474E06D6.9030601@mcs.anl.gov>
Message-ID: <1196296318.29963.19.camel@blabla.mcs.anl.gov>


On Wed, 2007-11-28 at 18:24 -0600, Ian Foster wrote:
> Mihael:
> 
> It isn't clear to me--are you using the "lots of small files" 
> optimization here?

It depends what you mean by "lots of small files optimization".
Obviously this is an optimization for the lots of small files case.

I'm re-using clients with mode E and only sending PASV once per client.
Let's call this A. There was word of "pipelining". We'll call that B. I
assume it to be different from what I did (A) for the following reasons:
1. Jarek had tests for A in JGlobus, so A is not a new deal.
2. Buzz recently committed some code to JGlobus to enable B, which
assumes B was not possible before, therefore B != A.

> 
> I've CCed John Bresnahan so he can comment.
> 
> Ian.
> 
> Mihael Hategan wrote:
> > So I've been playing with that issue. I've made some measurements
> > outside Swift. Here's a summary:
> >
> > 32k files. From terminable to tg-uc
> >
> > 1 - karajan with connection caching. transfers in parallel. tops at
> > 200KB/s
> >
> > 2 - n*globus-url-copy - With 32 parallel transfers it starts failing and
> > gets about 10KB/s
> >
> > 3 - globus-url-copy with a list of files: around 300KB/s
> >
> > 4 - globus-url-copy with a list of files, E mode, and data channel
> > re-use: 500KB/s
> >
> > So I figured I should hack the GridFTP provider to re-use data channels
> > by default. This is where it gets strange. I get averages (over multiple
> > runs) of over 1MB/s, with mins of about 130KB and max of 1.9MB/s, but
> > with a lot of variability. I'll debug this. However, I think there is
> > still value in enabling this by default.
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >
> >   
> 


From itf at mcs.anl.gov  Wed Nov 28 18:54:50 2007
From: itf at mcs.anl.gov (=?utf-8?B?SWFuIEZvc3Rlcg==?=)
Date: Thu, 29 Nov 2007 00:54:50 +0000
Subject: [Swift-devel] transfers of small files
In-Reply-To: <1196296318.29963.19.camel@blabla.mcs.anl.gov>
References: <1196295618.29963.10.camel@blabla.mcs.anl.gov>
	<474E06D6.9030601@mcs.anl.gov><1196296318.29963.19.camel@blabla.mcs.anl.gov>
Message-ID: <794941865-1196297704-cardhu_decombobulator_blackberry.rim.net-1612396432-@bxe017.bisx.prod.on.blackberry>

As mentioned in an email from a few weeks ago, the gridftp guys have implemented support for streaming many small files. I would hope we would try that before implementing our own version.

Ian


Sent via BlackBerry from T-Mobile

-----Original Message-----
From: Mihael Hategan <hategan at mcs.anl.gov>

Date: Wed, 28 Nov 2007 18:31:58 
To:Ian Foster <foster at mcs.anl.gov>
Cc:swift-devel <swift-devel at ci.uchicago.edu>,  John Bresnahan <bresnaha at mcs.anl.gov>
Subject: Re: [Swift-devel] transfers of small files


On Wed, 2007-11-28 at 18:24 -0600, Ian Foster wrote:
> Mihael:
> 
> It isn't clear to me--are you using the "lots of small files" 
> optimization here?

It depends what you mean by "lots of small files optimization".
Obviously this is an optimization for the lots of small files case.

I'm re-using clients with mode E and only sending PASV once per client.
Let's call this A. There was word of "pipelining". We'll call that B. I
assume it to be different from what I did (A) for the following reasons:
1. Jarek had tests for A in JGlobus, so A is not a new deal.
2. Buzz recently committed some code to JGlobus to enable B, which
assumes B was not possible before, therefore B != A.

> 
> I've CCed John Bresnahan so he can comment.
> 
> Ian.
> 
> Mihael Hategan wrote:
> > So I've been playing with that issue. I've made some measurements
> > outside Swift. Here's a summary:
> >
> > 32k files. From terminable to tg-uc
> >
> > 1 - karajan with connection caching. transfers in parallel. tops at
> > 200KB/s
> >
> > 2 - n*globus-url-copy - With 32 parallel transfers it starts failing and
> > gets about 10KB/s
> >
> > 3 - globus-url-copy with a list of files: around 300KB/s
> >
> > 4 - globus-url-copy with a list of files, E mode, and data channel
> > re-use: 500KB/s
> >
> > So I figured I should hack the GridFTP provider to re-use data channels
> > by default. This is where it gets strange. I get averages (over multiple
> > runs) of over 1MB/s, with mins of about 130KB and max of 1.9MB/s, but
> > with a lot of variability. I'll debug this. However, I think there is
> > still value in enabling this by default.
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >
> >   
> 


From bugzilla-daemon at mcs.anl.gov  Fri Nov 30 14:03:07 2007
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Fri, 30 Nov 2007 14:03:07 -0600 (CST)
Subject: [Swift-devel] [Bug 114] New: need to specify run directory name on
	remote site
Message-ID: <bug-114-21@http.bugzilla.mcs.anl.gov/swift/>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=114

           Summary: need to specify run directory name on remote site
           Product: Swift
           Version: unspecified
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: major
          Priority: P2
         Component: SwiftScript language
        AssignedTo: benc at hawaga.org.uk
        ReportedBy: skenny at uchicago.edu


currently the run directory on the remote site is auto-generated by swift. it
is important to be able to specify the directory name, especially if it will be
used with a portal and/or a community cert so that directory names can include
user name.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.