From benc at hawaga.org.uk  Sun Mar  2 09:25:30 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sun, 2 Mar 2008 15:25:30 +0000 (GMT)
Subject: [Swift-devel] swift 0.4-rc1
Message-ID: <Pine.LNX.4.64.0803021523340.32747@dildano.hawaga.org.uk>


I've just made release candidate 1 for swift 0.4. It passes my superficial 
testing. Please download and test it, ideally with your own big 
applications.

If there are no major problems, this will be released as swift 0.4 
sometime Tuesday.

http://www.ci.uchicago.edu/~benc/vdsk-0.4-rc1.tar.gz

$ md5sum /home/benc/public_html/vdsk-0.4-rc1.tar.gz 
90dfd5a91f27a0aea2c0cd56642e7721  
/home/benc/public_html/vdsk-0.4-rc1.tar.gz

It is Swift SVN r1696 and CoG SVN r1933


From benc at hawaga.org.uk  Sun Mar  2 10:56:25 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sun, 2 Mar 2008 16:56:25 +0000 (GMT)
Subject: [Swift-devel] stageout: Expected multiline reply
Message-ID: <Pine.LNX.4.64.0803021653050.32747@dildano.hawaga.org.uk>

Running from terminable to tg-uc, I get an error at stageout, starting 
with the below:

Caused by: org.globus.ftp.exception.ServerException:  Custom message: 
Could not 
create MlsxEntry [Nested exception message:  Custom message: Expected 
multiline 
reply] [Nested exception is org.globus.ftp.exception.FTPException:  Custom 
messa
ge: Expected multiline reply]
        at org.globus.ftp.FTPClient.mlst(FTPClient.java:642)
        at 
org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl.isDirec
tory(FileResourceImpl.java:159)


The complete log file is 
http://www.ci.uchicago.edu/~benc/tmp/061-cattwo-20080302-1031-vdualzq0.log

This is with 0.4rc1. This is using gram2 to submit to PBS. Other runs from 
my testing appeared to have worked ok submitting to fork. I'll dig a bit 
deeper, but I'm not sure what the likely meaning of the above error is.

-- 


From hategan at mcs.anl.gov  Sun Mar  2 14:06:24 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 02 Mar 2008 14:06:24 -0600
Subject: [Swift-devel] stageout: Expected multiline reply
In-Reply-To: <Pine.LNX.4.64.0803021653050.32747@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0803021653050.32747@dildano.hawaga.org.uk>
Message-ID: <1204488384.7286.0.camel@blabla.mcs.anl.gov>

Repeatable or not? Consistent or not?

On Sun, 2008-03-02 at 16:56 +0000, Ben Clifford wrote:
> Running from terminable to tg-uc, I get an error at stageout, starting 
> with the below:
> 
> Caused by: org.globus.ftp.exception.ServerException:  Custom message: 
> Could not 
> create MlsxEntry [Nested exception message:  Custom message: Expected 
> multiline 
> reply] [Nested exception is org.globus.ftp.exception.FTPException:  Custom 
> messa
> ge: Expected multiline reply]
>         at org.globus.ftp.FTPClient.mlst(FTPClient.java:642)
>         at 
> org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl.isDirec
> tory(FileResourceImpl.java:159)
> 
> 
> The complete log file is 
> http://www.ci.uchicago.edu/~benc/tmp/061-cattwo-20080302-1031-vdualzq0.log
> 
> This is with 0.4rc1. This is using gram2 to submit to PBS. Other runs from 
> my testing appeared to have worked ok submitting to fork. I'll dig a bit 
> deeper, but I'm not sure what the likely meaning of the above error is.
> 


From benc at hawaga.org.uk  Sun Mar  2 14:08:41 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sun, 2 Mar 2008 20:08:41 +0000 (GMT)
Subject: [Swift-devel] stageout: Expected multiline reply
In-Reply-To: <1204488384.7286.0.camel@blabla.mcs.anl.gov>
References: <Pine.LNX.4.64.0803021653050.32747@dildano.hawaga.org.uk>
	<1204488384.7286.0.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0803022007300.32747@dildano.hawaga.org.uk>


at least today its repeatable and consistent against TGUC using pbs+gram2 
but doesn't happen any time I use fork+gram2.

some other sites that I've run it against haven't had this problem.

I rarely run stuff from terminable so I have no idea if this has been the 
case for some time or is for one day only.

On Sun, 2 Mar 2008, Mihael Hategan wrote:

> Repeatable or not? Consistent or not?
> 
> On Sun, 2008-03-02 at 16:56 +0000, Ben Clifford wrote:
> > Running from terminable to tg-uc, I get an error at stageout, starting 
> > with the below:
> > 
> > Caused by: org.globus.ftp.exception.ServerException:  Custom message: 
> > Could not 
> > create MlsxEntry [Nested exception message:  Custom message: Expected 
> > multiline 
> > reply] [Nested exception is org.globus.ftp.exception.FTPException:  Custom 
> > messa
> > ge: Expected multiline reply]
> >         at org.globus.ftp.FTPClient.mlst(FTPClient.java:642)
> >         at 
> > org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl.isDirec
> > tory(FileResourceImpl.java:159)
> > 
> > 
> > The complete log file is 
> > http://www.ci.uchicago.edu/~benc/tmp/061-cattwo-20080302-1031-vdualzq0.log
> > 
> > This is with 0.4rc1. This is using gram2 to submit to PBS. Other runs from 
> > my testing appeared to have worked ok submitting to fork. I'll dig a bit 
> > deeper, but I'm not sure what the likely meaning of the above error is.
> > 
> 
> 


From hategan at mcs.anl.gov  Sun Mar  2 14:13:10 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sun, 02 Mar 2008 14:13:10 -0600
Subject: [Swift-devel] stageout: Expected multiline reply
In-Reply-To: <Pine.LNX.4.64.0803022007300.32747@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0803021653050.32747@dildano.hawaga.org.uk>
	<1204488384.7286.0.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0803022007300.32747@dildano.hawaga.org.uk>
Message-ID: <1204488790.7286.3.camel@blabla.mcs.anl.gov>

So we may either be dealing with a gridftp incompatibility or some
strange cog thing. I'll have to debug to be able to tell.

On Sun, 2008-03-02 at 20:08 +0000, Ben Clifford wrote:
> at least today its repeatable and consistent against TGUC using pbs+gram2 
> but doesn't happen any time I use fork+gram2.
> 
> some other sites that I've run it against haven't had this problem.
> 
> I rarely run stuff from terminable so I have no idea if this has been the 
> case for some time or is for one day only.
> 
> On Sun, 2 Mar 2008, Mihael Hategan wrote:
> 
> > Repeatable or not? Consistent or not?
> > 
> > On Sun, 2008-03-02 at 16:56 +0000, Ben Clifford wrote:
> > > Running from terminable to tg-uc, I get an error at stageout, starting 
> > > with the below:
> > > 
> > > Caused by: org.globus.ftp.exception.ServerException:  Custom message: 
> > > Could not 
> > > create MlsxEntry [Nested exception message:  Custom message: Expected 
> > > multiline 
> > > reply] [Nested exception is org.globus.ftp.exception.FTPException:  Custom 
> > > messa
> > > ge: Expected multiline reply]
> > >         at org.globus.ftp.FTPClient.mlst(FTPClient.java:642)
> > >         at 
> > > org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl.isDirec
> > > tory(FileResourceImpl.java:159)
> > > 
> > > 
> > > The complete log file is 
> > > http://www.ci.uchicago.edu/~benc/tmp/061-cattwo-20080302-1031-vdualzq0.log
> > > 
> > > This is with 0.4rc1. This is using gram2 to submit to PBS. Other runs from 
> > > my testing appeared to have worked ok submitting to fork. I'll dig a bit 
> > > deeper, but I'm not sure what the likely meaning of the above error is.
> > > 
> > 
> > 
> 


From benc at hawaga.org.uk  Sun Mar  2 22:33:51 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 3 Mar 2008 04:33:51 +0000 (GMT)
Subject: [Swift-devel] Re: stageout: Expected multiline reply
In-Reply-To: <Pine.LNX.4.64.0803021653050.32747@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0803021653050.32747@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0803030433360.32747@dildano.hawaga.org.uk>

this doesn't happen now i try it again a few hours later. grr.
-- 


From hategan at mcs.anl.gov  Mon Mar  3 05:01:31 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 03 Mar 2008 05:01:31 -0600
Subject: [Swift-devel] Re: stageout: Expected multiline reply
In-Reply-To: <Pine.LNX.4.64.0803030433360.32747@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0803021653050.32747@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803030433360.32747@dildano.hawaga.org.uk>
Message-ID: <1204542091.9797.1.camel@blabla.mcs.anl.gov>

Maybe it was a glitch in the gridftp server. The error message sounds
like it was (i.e. a protocol problem).

On Mon, 2008-03-03 at 04:33 +0000, Ben Clifford wrote:
> this doesn't happen now i try it again a few hours later. grr.


From benc at hawaga.org.uk  Tue Mar  4 17:33:18 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 4 Mar 2008 23:33:18 +0000 (GMT)
Subject: [Swift-devel] Swift log processing code
Message-ID: <Pine.LNX.4.64.0803042329520.32747@dildano.hawaga.org.uk>


Over the past few months, I've been developing log processing and analysis 
code that takes in various log files from Swift runs and uses them to make 
various plots.

I have written a small note on how to download and use these tools, so 
that others can experiment with them: 
http://www.ci.uchicago.edu/swift/guides/log-processing.php

Much of the output is rather rough and poorly documented, however I'm 
quite happy to explain stuff on these lists if/when people have questions.

-- 


From benc at hawaga.org.uk  Wed Mar  5 08:01:49 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 5 Mar 2008 14:01:49 +0000 (GMT)
Subject: [Swift-devel] Re: stageout: Expected multiline reply
In-Reply-To: <Pine.LNX.4.64.0803030433360.32747@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0803021653050.32747@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803030433360.32747@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0803051333150.32747@dildano.hawaga.org.uk>


On Mon, 3 Mar 2008, Ben Clifford wrote:

> this doesn't happen now i try it again a few hours later. grr.

actually it does still happen most of the time. from terminable to tg-uc 
using gram2.
all the other sites in my testing appear to work ok (those being the sites 
that I just committed to tests/sites/ in the SVN) - including tguc + gram4 
+ pbs and teraport + gram2 + pbs.

This happens both with 0.4rc1 and yesterday's HEADs


-- 


From hategan at mcs.anl.gov  Wed Mar  5 08:13:49 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 05 Mar 2008 08:13:49 -0600
Subject: [Swift-devel] Re: stageout: Expected multiline reply
In-Reply-To: <Pine.LNX.4.64.0803051333150.32747@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0803021653050.32747@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803030433360.32747@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803051333150.32747@dildano.hawaga.org.uk>
Message-ID: <1204726429.4180.46.camel@blabla.mcs.anl.gov>

Ok, can you enable debug on org.globus.ftp?

On Wed, 2008-03-05 at 14:01 +0000, Ben Clifford wrote:
> On Mon, 3 Mar 2008, Ben Clifford wrote:
> 
> > this doesn't happen now i try it again a few hours later. grr.
> 
> actually it does still happen most of the time. from terminable to tg-uc 
> using gram2.
> all the other sites in my testing appear to work ok (those being the sites 
> that I just committed to tests/sites/ in the SVN) - including tguc + gram4 
> + pbs and teraport + gram2 + pbs.
> 
> This happens both with 0.4rc1 and yesterday's HEADs
> 
> 
> 


From benc at hawaga.org.uk  Wed Mar  5 08:42:00 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 5 Mar 2008 14:42:00 +0000 (GMT)
Subject: [Swift-devel] Re: stageout: Expected multiline reply
In-Reply-To: <1204726429.4180.46.camel@blabla.mcs.anl.gov>
References: <Pine.LNX.4.64.0803021653050.32747@dildano.hawaga.org.uk> 
	<Pine.LNX.4.64.0803030433360.32747@dildano.hawaga.org.uk> 
	<Pine.LNX.4.64.0803051333150.32747@dildano.hawaga.org.uk>
	<1204726429.4180.46.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0803051441340.32747@dildano.hawaga.org.uk>


On Wed, 5 Mar 2008, Mihael Hategan wrote:

> Ok, can you enable debug on org.globus.ftp?

http://www.ci.uchicago.edu/~benc/tmp/061-cattwo-20080305-0819-28le50uc.log

This is running 061-cattwo in tests/language-behaviour/

-- 


From hategan at mcs.anl.gov  Thu Mar  6 05:42:44 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 06 Mar 2008 05:42:44 -0600
Subject: [Swift-devel] Re: stageout: Expected multiline reply
In-Reply-To: <Pine.LNX.4.64.0803051441340.32747@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0803021653050.32747@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803030433360.32747@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803051333150.32747@dildano.hawaga.org.uk>
	<1204726429.4180.46.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0803051441340.32747@dildano.hawaga.org.uk>
Message-ID: <1204803764.5265.2.camel@blabla.mcs.anl.gov>

There's quite a bit of weirdness there. I'll have to try to reproduce it
and dig deeper.

On Wed, 2008-03-05 at 14:42 +0000, Ben Clifford wrote:
> On Wed, 5 Mar 2008, Mihael Hategan wrote:
> 
> > Ok, can you enable debug on org.globus.ftp?
> 
> http://www.ci.uchicago.edu/~benc/tmp/061-cattwo-20080305-0819-28le50uc.log
> 
> This is running 061-cattwo in tests/language-behaviour/
> 


From benc at hawaga.org.uk  Thu Mar  6 11:26:02 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 6 Mar 2008 17:26:02 +0000 (GMT)
Subject: [Swift-devel] Failed null
Message-ID: <Pine.LNX.4.64.0803061725150.32747@dildano.hawaga.org.uk>

I'm seeing plenty of errors during stagein that look like this:

2008-03-06 10:42:02,911-0600 DEBUG TaskImpl Task(type=FILE_TRANSFER, 
identity=urn:0-624-4-1204821602563) setting status to Submitting
2008-03-06 10:42:02,911-0600 DEBUG TaskImpl Task(type=FILE_TRANSFER, 
identity=urn:0-624-4-1204821602563) setting status to Submitted
2008-03-06 10:42:02,912-0600 DEBUG TaskImpl Task(type=FILE_TRANSFER, 
identity=urn:0-624-4-1204821602563) setting status to Active
2008-03-06 10:42:02,937-0600 DEBUG TaskImpl Task(type=FILE_TRANSFER, 
identity=urn:0-624-4-1204821602563) setting status to Failed null

'null' is not so helpful - also there's indicating which file was 
attempting to be transferred here...
-- 


From benc at hawaga.org.uk  Thu Mar  6 12:50:50 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 6 Mar 2008 18:50:50 +0000 (GMT)
Subject: [Swift-devel] Re: Failed null
In-Reply-To: <Pine.LNX.4.64.0803061725150.32747@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0803061725150.32747@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0803061849430.32747@dildano.hawaga.org.uk>


I got mike kubal to rerun with org.globus.ftp debugging on and I see this:

2008-03-06 12:20:15,173-0600 DEBUG FTPControlChannel Control channel 
sending: PA
SV

2008-03-06 12:20:15,174-0600 DEBUG Reply read 1st line
2008-03-06 12:20:15,180-0600 DEBUG Reply 1st line: 227 Entering Passive 
Mode (19
2,5,198,208,195,86)
2008-03-06 12:20:15,180-0600 DEBUG FTPControlChannel Control channel 
received: 2
27 Entering Passive Mode (192,5,198,208,195,86)
2008-03-06 12:20:15,180-0600 DEBUG GridFTPServerFacade hostport: 
192.5.198.208 5
0006
2008-03-06 12:20:15,180-0600 DEBUG TransferThreadManager adding new empty 
socket
Box to the socket pool
2008-03-06 12:20:15,180-0600 DEBUG SocketPool adding a free socket
2008-03-06 12:20:15,180-0600 DEBUG TransferThreadManager connecting active 
socke
t 0; total cached sockets = 1
2008-03-06 12:20:15,185-0600 DEBUG TaskThread executing task: 
org.globus.ftp.dc.
GridFTPActiveConnectTask at 851105
2008-03-06 12:20:15,186-0600 DEBUG GridFTPActiveConnectTask connecting new 
socke
t to: 192.5.198.208 50006
2008-03-06 12:20:15,188-0600 DEBUG TaskImpl Task(type=FILE_TRANSFER, 
identity=ur
n:0-114-5-1204827503989) setting status to Failed null


not sure what's happening after 'connecting new socket' that makes the 
provider decide its failed...

-- 


From benc at hawaga.org.uk  Thu Mar  6 13:05:13 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 6 Mar 2008 19:05:13 +0000 (GMT)
Subject: [Swift-devel] hanging without allocating a host
Message-ID: <Pine.LNX.4.64.0803061903490.32747@dildano.hawaga.org.uk>


In the last couple of runs that mike kubal has made, with a large number 
of file transfer failures, the workflow eventually hangs with no karajan 
tasks in progress, and (according to a debug line I just added) 
apparently waiting for a host to be allocated - this will, I guess, never 
happen as nothing is happening to change the host scores. bleugh.

-- 


From benc at hawaga.org.uk  Thu Mar  6 13:07:03 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 6 Mar 2008 19:07:03 +0000 (GMT)
Subject: [Swift-devel] Re: hanging without allocating a host
In-Reply-To: <Pine.LNX.4.64.0803061903490.32747@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0803061903490.32747@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0803061906060.32747@dildano.hawaga.org.uk>


the most recent log for this is 
wiggum:/nfs/dsl-homes03/mkubal/IMPDH/Swift_MD_Runs/Fixed_Ligands/ligands/run_MD_pipeline_loop_for_impdh-20080306-1218-g6aci607.lo

it has ftp logging turned on and shows the FTP problem, and also you can 
see it stop at 13:01 after a successful job completion (one of about 5 for 
the whole run) and hang for a long time.
-- 


From benc at hawaga.org.uk  Thu Mar  6 18:10:17 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 7 Mar 2008 00:10:17 +0000 (GMT)
Subject: [Swift-devel] Re: swift 0.4-rc1
In-Reply-To: <Pine.LNX.4.64.0803021523340.32747@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0803021523340.32747@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0803070009470.32747@dildano.hawaga.org.uk>


On Sun, 2 Mar 2008, Ben Clifford wrote:

> If there are no major problems, this will be released as swift 0.4 
> sometime Tuesday.

The gridftp problems of the past few days are giving me bad vibes so I'm 
going to wait a while.

-- 


From iraicu at cs.uchicago.edu  Fri Mar  7 00:03:04 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Fri, 07 Mar 2008 00:03:04 -0600
Subject: [Swift-devel] Re: Question of wrapper.sh
In-Reply-To: <Pine.LNX.4.64.0803062323120.32747@dildano.hawaga.org.uk>
References: <47BEC111.8030503@mcs.anl.gov> <47BF3593.7030600@mcs.anl.gov>
	<47BF36EE.90504@cs.uchicago.edu> <47BF377A.6020707@uchicago.edu>
	<47BF38C8.10507@cs.uchicago.edu> <47BF4D3C.7030100@uchicago.edu>
	<47BF4FCD.9070904@cs.uchicago.edu> <47C13DEC.4090008@uchicago.edu>
	<47C1C5CD.5080906@cs.uchicago.edu>
	<47C1C8C6.1020009@cs.uchicago.edu>
	<47C2E11F.2070208@mcs.anl.gov> <47C2E483.70005@cs.uchicago.edu>
	<47C2FB7B.3000901@uchicago.edu> <47C30053.8010608@mcs.anl.gov>
	<47C3E648.8040101@uchicago.edu> <47C421FC.10405@mcs.anl.gov>
	<47C429D0.9010109@cs.uchicago.edu> <47C43C40.50702@mcs.anl.gov>
	<47C43D15.8080608@cs.uchicago.edu> <47C440F9.6090701@mcs.anl.gov>
	<47C44238.9060007@cs.uchicago.edu>
	<47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov>
	<Pine.LNX.4.64.0802262356100.32747@dildano.hawaga.org.uk>
	<47C5039F.8020001@cs.uchicago.edu>
	<Pine.LNX.4.64.0802270632200.32747@dildano.hawaga.org.uk>
	<47D07BDC.60406@uchicago.edu>
	<Pine.LNX.4.64.0803062323120.32747@dildano.hawaga.org. uk>
Message-ID: <47D0DA98.6010308@cs.uchicago.edu>


Ben Clifford wrote:
> you should send questions like this to swift-devel or swift-user list 
> rather than attempting to compose your own list of likely candidates and 
> witholding the information from the public archives.
>   
Made the reply to the Swift devel mailing lists...
>   
>> I am trying to dig into the wrapper.sh, disable the log to enhance the
>> performance.
>>     
>
> Do you have numbers that suggest logging is causing a performance 
> degradation? 
By default, Swift is able to do about 5 jobs/sec running over Falkon on 
256 CPUs on the BG/P, where each job is a sleep 0.  The Falkon command 
line client can do about 1700 jobs/sec on the same hardware.  9 months 
ago, I saw Swift go from a few jobs/sec to about 50 jobs/sec by 
stripping out all logging (i.e. echo "..." >> LOG) from the wrapper 
script, and by removing the mkdir and symbolic linking.  Since the mkdir 
is much improved now, I assume that is not the bottleneck, but doing 
10~20 echo to a log file on the shared file system from many nodes at 
the same time is expensive, which I think is the main bottleneck in the 
current wrapper script.  Once Zhao is done disabling all logging, except 
for necessary ones, we'll have a better idea of how fast we can go, and 
if it is necessary to eliminate the mkdir step as well.  I think getting 
about 50 jobs/sec is within reach by streamlining the wrapper.sh script, 
but I think we'll have to think of ways to push those numbers even higher!
> I notice you're using quite an old version of swift 
iraicu at login1.surveyor:/home/zzhang/cog/modules/vdsk> svn info
Path: .
URL: https://svn.ci.uchicago.edu/svn/vdl2/trunk
Repository Root: https://svn.ci.uchicago.edu/svn/vdl2
Repository UUID: e2bb083e-7f23-0410-b3a8-8253ac9ef6d8
Revision: 1673
Node Kind: directory
Schedule: normal
Last Changed Author: benc at CI.UCHICAGO.EDU
Last Changed Rev: 1670
Last Changed Date: 2008-02-09 12:42:56 -0600 (Sat, 09 Feb 2008)

It doesn't seem that old, but we'll update to the latest one before we 
do more experiments.
> (the last 
> release) - we made substantial log speed improvements subsequent to that. 
> If you're hitting log file problems here, there is a fair chance that 
> you'll encounter other scalability problems on the site filesystem that we 
> also fixed in SVN some months ago.
>   
Right, I know, and I thought we were using a late enough version that 
had those fixes.  Just to be sure, we'll upgrade!
>   
>> One thing I notice is that for each job, correct me if I am wrong, SWIFT 
>> will make a unique directory with the date and a random string, then 
>> copy wrapper.sh and other necessary files to that directory.
>>     
>
> It should do that one per workflow per site, not per job.
>   
Every job still has a scratch space sandbox, which results in a mkdir, 
symbolic linking, and finally a cleanup remove dir.  I think this is the 
dir he is referring to.  BTW, if there would be an easy way to eliminate 
this entire mkdir part of the wrapper script without breaking anything 
in Swift, it would be nice.  The apps we are dealing with don't need the 
sandboxing, as we know all input files, and all output files, and we'll 
never have input as *.fits that might be ambiguous if we don't sandbox.

Ioan
>   
>>        echo -abc  "Hello, world!" stdout=@filename(t);
>>     
>
> Put the -abc in quotes:
>
> echo "-abc" "hello"
>
> to solve the immediate problem.
>
> However, note that the command:
>     echo -abc hello
> executes successfully on my linux and os x boxes.
>
> If you want a job that will fail, try the 'false' command.
>
>   
>> RunID: 20080306-1647-4nd1cymf
>> Execution failed:
>>        Variable not found: abc
>>     
>
> This is because you did not quote "-abc", so swift is trying to give you 
> the unary negative value of -abc (just like if you said -abc in Java or 
> C).
>
>   
>> But I still can not find the default working directory of this task. 
>> Also, I know there is a log file for this wrapper, so it is in the 
>> working directory, right?
>>     
>
> Swift will never have attempted to run the above, because of the above 
> error.
>
>   
>> Another question is, could you give me a simple task description of 
>> wrapper.sh? So I could invoke wrapper.sh directly without falkon. I got 
>> a task description before,
>>
>> 140.221.82.10 : urn:0-195-1203621652641 : EXECUTABLE /bin/bash ARGUEMENTS
>> shared/wrapper.sh sleep-1j38kqoi -jobdir 1 -e /bin/sleep -out stdout.txt -err
>> stderr.txt -i -d  -if  -of  -k  -a 0
>>
>> but it is within the working directory, and I don't understand what
>> "sleep-1j38kqoi" means.
>>     
>
> sleep-1j38kqoi is a job identifier (in Swift internal language, an 
> execute2 identifier, perhaps) which identifies one attempt to run an 
> application. This is used to label log files and working directories for 
> this.
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080307/18b55d75/attachment.html>

From benc at hawaga.org.uk  Fri Mar  7 02:36:40 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 7 Mar 2008 08:36:40 +0000 (GMT)
Subject: [Swift-devel] Re: Question of wrapper.sh
In-Reply-To: <47D0DA98.6010308@cs.uchicago.edu>
References: <47BEC111.8030503@mcs.anl.gov> <47BF38C8.10507@cs.uchicago.edu>
	<47BF4D3C.7030100@uchicago.edu> <47BF4FCD.9070904@cs.uchicago.edu>
	<47C13DEC.4090008@uchicago.edu> <47C1C5CD.5080906@cs.uchicago.edu>
	<47C1C8C6.1020009@cs.uchicago.edu> <47C2E11F.2070208@mcs.anl.gov>
	<47C2E483.70005@cs.uchicago.edu> <47C2FB7B.3000901@uchicago.edu>
	<47C30053.8010608@mcs.anl.gov> <47C3E648.8040101@uchicago.edu>
	<47C421FC.10405@mcs.anl.gov> <47C429D0.9010109@cs.uchicago.edu>
	<47C43C40.50702@mcs.anl.gov> <47C43D15.8080608@cs.uchicago.edu>
	<47C440F9.6090701@mcs.anl.gov> <47C44238.9060007@cs.uchicago.edu>
	<47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov>
	<Pine.LNX.4.64.0802262356100.32747@dildano.hawaga.org.uk>
	<47C5039F.8020001@cs.uchicago.edu>
	<Pine.LNX.4.64.0802270632200.32747@dildano.hawaga.org.uk>
	<47D07BDC.60406@uchicago.edu>
	<Pine.LNX.4.64.0803062323120.32747@dildano.hawaga.org.
	uk> <47D0DA98.6010308@cs.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0803070831480.32747@dildano.hawaga.org.uk>


On Fri, 7 Mar 2008, Ioan Raicu wrote:


> > I notice you're using quite an old version of swift 
> iraicu at login1.surveyor:/home/zzhang/cog/modules/vdsk> svn info

> Revision: 1673

Zhao's log reported this:

zhaozhang at viper:~/vdsk-0.3/examples/vdsk> swift first.swift                     
Swift v0.3 r1319 (modified locally)   

which is nowhere near the version that you report - something in the 1600s 
is a reasonable number, but that's not what the log output was. So please 
clarify which version you actually have poor behaviour for. If it is in 
the r1600 range, then I'm interested to fix up - but in the r1600s there 
should be absolutely no cross-node shared log files at all.

> Every job still has a scratch space sandbox, which results in a mkdir, 
> symbolic linking, and finally a cleanup remove dir.  I think this is the 
> dir he is referring to.  BTW, if there would be an easy way to eliminate 
> this entire mkdir part of the wrapper script without breaking anything 
> in Swift, it would be nice.  The apps we are dealing with don't need the 
> sandboxing, as we know all input files, and all output files, and we'll 
> never have input as *.fits that might be ambiguous if we don't sandbox.

If the jobs touch only those files, I think you can probably eliminate the 
mkdir, the ln to copy files in and the cp to copy files back out, and run 
the job in the shared directory directly. However, as each node will then 
be trying to do stuff to that shared directly, my initial thoughts would 
be that it wouldn't really change much (perhaps better, perhaps worse).

-- 


From benc at hawaga.org.uk  Fri Mar  7 02:43:48 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 7 Mar 2008 08:43:48 +0000 (GMT)
Subject: [Swift-devel] Re: Question of wrapper.sh
In-Reply-To: <47D0DA98.6010308@cs.uchicago.edu>
References: <47BEC111.8030503@mcs.anl.gov> <47BF38C8.10507@cs.uchicago.edu>
	<47BF4D3C.7030100@uchicago.edu> <47BF4FCD.9070904@cs.uchicago.edu>
	<47C13DEC.4090008@uchicago.edu> <47C1C5CD.5080906@cs.uchicago.edu>
	<47C1C8C6.1020009@cs.uchicago.edu> <47C2E11F.2070208@mcs.anl.gov>
	<47C2E483.70005@cs.uchicago.edu> <47C2FB7B.3000901@uchicago.edu>
	<47C30053.8010608@mcs.anl.gov> <47C3E648.8040101@uchicago.edu>
	<47C421FC.10405@mcs.anl.gov> <47C429D0.9010109@cs.uchicago.edu>
	<47C43C40.50702@mcs.anl.gov> <47C43D15.8080608@cs.uchicago.edu>
	<47C440F9.6090701@mcs.anl.gov> <47C44238.9060007@cs.uchicago.edu>
	<47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov>
	<Pine.LNX.4.64.0802262356100.32747@dildano.hawaga.org.uk>
	<47C5039F.8020001@cs.uchicago.edu>
	<Pine.LNX.4.64.0802270632200.32747@dildano.hawaga.org.uk>
	<47D07BDC.60406@uchicago.edu>
	<Pine.LNX.4.64.0803062323120.32747@dildano.hawaga.org.
	uk> <47D0DA98.6010308@cs.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0803070840150.32747@dildano.hawaga.org.uk>


On Fri, 7 Mar 2008, Ioan Raicu wrote:

> symbolic linking.  Since the mkdir is much improved now, I assume that is not
> the bottleneck, but doing 10~20 echo to a log file on the shared file system
> from many nodes at the same time is expensive, which I think is the main
> bottleneck in the current wrapper script.  Once Zhao is done disabling all
> logging, except for necessary ones, we'll have a better idea of how fast we
> can go, and if it is necessary to eliminate the mkdir step as well. 

When I was playing with this around the time of SC, I put in a bunch of 
progress logging inside the wrapper script. This adds to the amount of 
logging that the wrapper does, but gives a many stage breakdown of where 
the wrapper script is spending its time.

Run a bunch of jobs, eg a few thousand, with latest SVN and 
wrapperlog.always.transfer=true set in swift.properties.

You'll get a <runid>.d directory, with a bunch of .info files. From there 
I (or you) can graph how each wrapper script spent its time.

Ideally there should be a bunch of steps taking almost no time, then the 
executable, then another bunch of steps taking almost no time; but doing 
this should reveal wrong behaviour there.

Poke me when you have that dump directory and I can have a look.

-- 


From iraicu at cs.uchicago.edu  Fri Mar  7 07:40:08 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Fri, 07 Mar 2008 07:40:08 -0600
Subject: [Swift-devel] Re: Question of wrapper.sh
In-Reply-To: <Pine.LNX.4.64.0803070831480.32747@dildano.hawaga.org.uk>
References: <47BEC111.8030503@mcs.anl.gov> <47BF38C8.10507@cs.uchicago.edu>
	<47BF4D3C.7030100@uchicago.edu> <47BF4FCD.9070904@cs.uchicago.edu>
	<47C13DEC.4090008@uchicago.edu> <47C1C5CD.5080906@cs.uchicago.edu>
	<47C1C8C6.1020009@cs.uchicago.edu>
	<47C2E11F.2070208@mcs.anl.gov> <47C2E483.70005@cs.uchicago.edu>
	<47C2FB7B.3000901@uchicago.edu> <47C30053.8010608@mcs.anl.gov>
	<47C3E648.8040101@uchicago.edu> <47C421FC.10405@mcs.anl.gov>
	<47C429D0.9010109@cs.uchicago.edu> <47C43C40.50702@mcs.anl.gov>
	<47C43D15.8080608@cs.uchicago.edu> <47C440F9.6090701@mcs.anl.gov>
	<47C44238.9060007@cs.uchicago.edu>
	<47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov>
	<Pine.LNX.4.64.0802262356100.32747@dildano.hawaga.org.uk>
	<47C5039F.8020001@cs.uchicago.edu>
	<Pine.LNX.4.64.0802270632200.32747@dildano.hawaga.org.uk>
	<47D07BDC.60406@uchicago.edu>
	<Pine.LNX.4.64.0803062323120.32747@dildano.hawaga.org. uk>
	<47D0DA98.6010308@cs.uchicago.edu>
	<Pine.LNX.4.64.0803070831480.32747@dildano.hawaga.org.uk >
Message-ID: <47D145B8.2030804@cs.uchicago.edu>


Ben Clifford wrote:
> On Fri, 7 Mar 2008, Ioan Raicu wrote:
>
>
>   
>>> I notice you're using quite an old version of swift 
>>>       
>> iraicu at login1.surveyor:/home/zzhang/cog/modules/vdsk> svn info
>>     
>
>   
>> Revision: 1673
>>     
>
> Zhao's log reported this:
>
> zhaozhang at viper:~/vdsk-0.3/examples/vdsk> swift first.swift                     
> Swift v0.3 r1319 (modified locally)   
>   
Notice that this Swift is from viper, a machine in RI.  He might be 
playing with this to get used to learning about Swift, but that is not 
being used on the BG/P in any way, as everything we run on the BG/P 
needs to run from the login nodes.  Zhao, can you confirm that the runs 
you are making with Swift on the BG/P are indeed using a recent 
version?  Can you also do an update of Swift on the BG/P to make sure we 
have the latest?
> which is nowhere near the version that you report - something in the 1600s 
> is a reasonable number, but that's not what the log output was. So please 
> clarify which version you actually have poor behaviour for. If it is in 
> the r1600 range, then I'm interested to fix up - but in the r1600s there 
> should be absolutely no cross-node shared log files at all.
>
>   
>> Every job still has a scratch space sandbox, which results in a mkdir, 
>> symbolic linking, and finally a cleanup remove dir.  I think this is the 
>> dir he is referring to.  BTW, if there would be an easy way to eliminate 
>> this entire mkdir part of the wrapper script without breaking anything 
>> in Swift, it would be nice.  The apps we are dealing with don't need the 
>> sandboxing, as we know all input files, and all output files, and we'll 
>> never have input as *.fits that might be ambiguous if we don't sandbox.
>>     
>
> If the jobs touch only those files, I think you can probably eliminate the 
> mkdir, the ln to copy files in and the cp to copy files back out, and run 
> the job in the shared directory directly. However, as each node will then 
> be trying to do stuff to that shared directly, my initial thoughts would 
> be that it wouldn't really change much (perhaps better, perhaps worse).
>   
That is what I thought.  We'll try that and see what we get!  We'll keep 
you posted.

Ioan

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080307/61e6c1de/attachment.html>

From iraicu at cs.uchicago.edu  Fri Mar  7 07:42:40 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Fri, 07 Mar 2008 07:42:40 -0600
Subject: [Swift-devel] Re: Question of wrapper.sh
In-Reply-To: <Pine.LNX.4.64.0803070840150.32747@dildano.hawaga.org.uk>
References: <47BEC111.8030503@mcs.anl.gov> <47BF38C8.10507@cs.uchicago.edu>
	<47BF4D3C.7030100@uchicago.edu> <47BF4FCD.9070904@cs.uchicago.edu>
	<47C13DEC.4090008@uchicago.edu> <47C1C5CD.5080906@cs.uchicago.edu>
	<47C1C8C6.1020009@cs.uchicago.edu>
	<47C2E11F.2070208@mcs.anl.gov> <47C2E483.70005@cs.uchicago.edu>
	<47C2FB7B.3000901@uchicago.edu> <47C30053.8010608@mcs.anl.gov>
	<47C3E648.8040101@uchicago.edu> <47C421FC.10405@mcs.anl.gov>
	<47C429D0.9010109@cs.uchicago.edu> <47C43C40.50702@mcs.anl.gov>
	<47C43D15.8080608@cs.uchicago.edu> <47C440F9.6090701@mcs.anl.gov>
	<47C44238.9060007@cs.uchicago.edu>
	<47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov>
	<Pine.LNX.4.64.0802262356100.32747@dildano.hawaga.org.uk>
	<47C5039F.8020001@cs.uchicago.edu>
	<Pine.LNX.4.64.0802270632200.32747@dildano.hawaga.org.uk>
	<47D07BDC.60406@uchicago.edu>
	<Pine.LNX.4.64.0803062323120.32747@dildano.hawaga.org. uk>
	<47D0DA98.6010308@cs.uchicago.edu>
	<Pine.LNX.4.64.0803070840150.32747@dildano.hawaga.org.uk >
Message-ID: <47D14650.2000206@cs.uchicago.edu>


Ben Clifford wrote:
> On Fri, 7 Mar 2008, Ioan Raicu wrote:
>
>   
>> symbolic linking.  Since the mkdir is much improved now, I assume that is not
>> the bottleneck, but doing 10~20 echo to a log file on the shared file system
>> from many nodes at the same time is expensive, which I think is the main
>> bottleneck in the current wrapper script.  Once Zhao is done disabling all
>> logging, except for necessary ones, we'll have a better idea of how fast we
>> can go, and if it is necessary to eliminate the mkdir step as well. 
>>     
>
> When I was playing with this around the time of SC, I put in a bunch of 
> progress logging inside the wrapper script. This adds to the amount of 
> logging that the wrapper does, but gives a many stage breakdown of where 
> the wrapper script is spending its time.
>
> Run a bunch of jobs, eg a few thousand, with latest SVN and 
> wrapperlog.always.transfer=true set in swift.properties.
>
> You'll get a <runid>.d directory, with a bunch of .info files. From there 
> I (or you) can graph how each wrapper script spent its time.
>
> Ideally there should be a bunch of steps taking almost no time, then the 
> executable, then another bunch of steps taking almost no time; but doing 
> this should reveal wrong behaviour there.
>
> Poke me when you have that dump directory and I can have a look.
>   
Ideally, we'd want any extra logging outside of the bare minimum to be 
optional, something that could be turned on or off depending on output 
level.  Maybe you or Mihael could work in such an option in the future, 
so we could easily disable all logging in the wrapper script if we need 
to.  In the meantime, we'll hack away to it ourselves :)

We'll try to do some back to back comparison runs, and save the logs, 
and let you know where they are for later debugging.

Thanks,
Ioan

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080307/e7dcf0a2/attachment.html>

From benc at hawaga.org.uk  Fri Mar  7 09:51:55 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 7 Mar 2008 15:51:55 +0000 (GMT)
Subject: [Swift-devel] Re: Question of wrapper.sh
In-Reply-To: <47D14650.2000206@cs.uchicago.edu>
References: <47BEC111.8030503@mcs.anl.gov> <47C13DEC.4090008@uchicago.edu>
	<47C1C5CD.5080906@cs.uchicago.edu> <47C1C8C6.1020009@cs.uchicago.edu>
	<47C2E11F.2070208@mcs.anl.gov> <47C2E483.70005@cs.uchicago.edu>
	<47C2FB7B.3000901@uchicago.edu> <47C30053.8010608@mcs.anl.gov>
	<47C3E648.8040101@uchicago.edu> <47C421FC.10405@mcs.anl.gov>
	<47C429D0.9010109@cs.uchicago.edu> <47C43C40.50702@mcs.anl.gov>
	<47C43D15.8080608@cs.uchicago.edu> <47C440F9.6090701@mcs.anl.gov>
	<47C44238.9060007@cs.uchicago.edu> <47C44EC2.4070901@uchicago.edu>
	<47C479EE.90901@mcs.anl.gov>
	<Pine.LNX.4.64.0802262356100.32747@dildano.hawaga.org.uk>
	<47C5039F.8020001@cs.uchicago.edu>
	<Pine.LNX.4.64.0802270632200.32747@dildano.hawaga.org.uk>
	<47D07BDC.60406@uchicago.edu>
	<Pine.LNX.4.64.0803062323120.32747@dildano.hawaga.org.
	uk> <47D0DA98.6010308@cs.uchicago.edu>
	<Pine.LNX.4.64.0803070840150.32747@dildano.hawaga.org.uk
	> <47D14650.2000206@cs.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0803071541400.32747@dildano.hawaga.org.uk>


On Fri, 7 Mar 2008, Ioan Raicu wrote:

> Ideally, we'd want any extra logging outside of the bare minimum to be
> optional, something that could be turned on or off depending on output level.
> Maybe you or Mihael could work in such an option in the future, so we could
> easily disable all logging in the wrapper script if we need to.  In the

You can perhaps convince me easily (or unconvince me) if you provide the 
-info files for a large test run. I implemented measurements there in 
november to provide quantified results for almost exactly this situation. 
You will considerably enhance this discourse by providing that information 
now rather than later.

-- 


From benc at hawaga.org.uk  Fri Mar  7 10:41:07 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 7 Mar 2008 16:41:07 +0000 (GMT)
Subject: [Swift-devel] Re: Question of wrapper.sh
In-Reply-To: <47D16F6D.9070903@uchicago.edu>
References: <47BEC111.8030503@mcs.anl.gov> <47C1C8C6.1020009@cs.uchicago.edu>
	<47C2E11F.2070208@mcs.anl.gov> <47C2E483.70005@cs.uchicago.edu>
	<47C2FB7B.3000901@uchicago.edu> <47C30053.8010608@mcs.anl.gov>
	<47C3E648.8040101@uchicago.edu> <47C421FC.10405@mcs.anl.gov>
	<47C429D0.9010109@cs.uchicago.edu> <47C43C40.50702@mcs.anl.gov>
	<47C43D15.8080608@cs.uchicago.edu> <47C440F9.6090701@mcs.anl.gov>
	<47C44238.9060007@cs.uchicago.edu> <47C44EC2.4070901@uchicago.edu>
	<47C479EE.90901@mcs.anl.gov>
	<Pine.LNX.4.64.0802262356100.32747@dildano.hawaga.org.uk>
	<47C5039F.8020001@cs.uchicago.edu>
	<Pine.LNX.4.64.0802270632200.32747@dildano.hawaga.org.uk>
	<47D07BDC.60406@uchicago.edu>
	<Pine.LNX.4.64.0803062323120.32747@dildano.hawaga.org.
	uk> <47D0DA98.6010308@cs.uchicago.edu>
	<Pine.LNX.4.64.0803070840150.32747@dildano.hawaga.org.uk
	> <47D14650.2000206@cs.uchicago.edu>
	<Pine.LNX.4.64.0803071541400.32747@dildano.hawaga.org.uk>
	<47D16F6D.9070903@uchicago.edu>
Message-ID: <Pine.LNX.4.64.0803071638260.18350@dildano.hawaga.org.uk>


On Fri, 7 Mar 2008, Zhao Zhang wrote:

> Where do those -info files live, if I run SWIFT locally on a linux box?

You will need to use a recent (as in r1715 swift, r1934 cog) build from 
SVN. Edit your swift.properties file so that that this setting:

wrapperlog.always.transfer=false

is changed to this:

wrapperlog.always.transfer=false

Then when you run, you will get a log file called:

whatever-20080101-9999-abcdef.log

and a corresponding directory,

whatever-20080101-9999-abcdef.d/

In that directory, you should see one *-info file for each job that is run 
(with the * being the jobid, the same as passed on the wrapper.sh command 
line that we talked about 12h ago)

-- 


From zhaozhang at uchicago.edu  Fri Mar  7 10:38:05 2008
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Fri, 07 Mar 2008 10:38:05 -0600
Subject: [Swift-devel] Re: Question of wrapper.sh
In-Reply-To: <Pine.LNX.4.64.0803071541400.32747@dildano.hawaga.org.uk>
References: <47BEC111.8030503@mcs.anl.gov> <47C13DEC.4090008@uchicago.edu>
	<47C1C5CD.5080906@cs.uchicago.edu>
	<47C1C8C6.1020009@cs.uchicago.edu>
	<47C2E11F.2070208@mcs.anl.gov> <47C2E483.70005@cs.uchicago.edu>
	<47C2FB7B.3000901@uchicago.edu> <47C30053.8010608@mcs.anl.gov>
	<47C3E648.8040101@uchicago.edu> <47C421FC.10405@mcs.anl.gov>
	<47C429D0.9010109@cs.uchicago.edu> <47C43C40.50702@mcs.anl.gov>
	<47C43D15.8080608@cs.uchicago.edu> <47C440F9.6090701@mcs.anl.gov>
	<47C44238.9060007@cs.uchicago.edu>
	<47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov>
	<Pine.LNX.4.64.0802262356100.32747@dildano.hawaga.org.uk>
	<47C5039F.8020001@cs.uchicago.edu>
	<Pine.LNX.4.64.0802270632200.32747@dildano.hawaga.org.uk>
	<47D07BDC.60406@uchicago.edu>
	<Pine.LNX.4.64.0803062323120.32747@dildano.hawaga.org. uk>
	<47D0DA98.6010308@cs.uchicago.edu>
	<Pine.LNX.4.64.0803070840150.32747@dildano.hawaga.org.uk >
	<47D14650.2000206@cs.uchicago.edu>
	<Pine.LNX.4.64.0803071541400.32747@dildano.hawaga.org.uk>
Message-ID: <47D16F6D.9070903@uchicago.edu>

Hi, Ben

Where do those -info files live, if I run SWIFT locally on a linux box?

zhao

Ben Clifford wrote:
> On Fri, 7 Mar 2008, Ioan Raicu wrote:
>
>   
>> Ideally, we'd want any extra logging outside of the bare minimum to be
>> optional, something that could be turned on or off depending on output level.
>> Maybe you or Mihael could work in such an option in the future, so we could
>> easily disable all logging in the wrapper script if we need to.  In the
>>     
>
> You can perhaps convince me easily (or unconvince me) if you provide the 
> -info files for a large test run. I implemented measurements there in 
> november to provide quantified results for almost exactly this situation. 
> You will considerably enhance this discourse by providing that information 
> now rather than later.
>
>   


From iraicu at cs.uchicago.edu  Fri Mar  7 11:16:29 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Fri, 07 Mar 2008 11:16:29 -0600
Subject: [Swift-devel] Re: Question of wrapper.sh
In-Reply-To: <Pine.LNX.4.64.0803071541400.32747@dildano.hawaga.org.uk>
References: <47BEC111.8030503@mcs.anl.gov> <47C13DEC.4090008@uchicago.edu>
	<47C1C5CD.5080906@cs.uchicago.edu>
	<47C1C8C6.1020009@cs.uchicago.edu>
	<47C2E11F.2070208@mcs.anl.gov> <47C2E483.70005@cs.uchicago.edu>
	<47C2FB7B.3000901@uchicago.edu> <47C30053.8010608@mcs.anl.gov>
	<47C3E648.8040101@uchicago.edu> <47C421FC.10405@mcs.anl.gov>
	<47C429D0.9010109@cs.uchicago.edu> <47C43C40.50702@mcs.anl.gov>
	<47C43D15.8080608@cs.uchicago.edu> <47C440F9.6090701@mcs.anl.gov>
	<47C44238.9060007@cs.uchicago.edu>
	<47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov>
	<Pine.LNX.4.64.0802262356100.32747@dildano.hawaga.org.uk>
	<47C5039F.8020001@cs.uchicago.edu>
	<Pine.LNX.4.64.0802270632200.32747@dildano.hawaga.org.uk>
	<47D07BDC.60406@uchicago.edu>
	<Pine.LNX.4.64.0803062323120.32747@dildano.hawaga.org. uk>
	<47D0DA98.6010308@cs.uchicago.edu>
	<Pine.LNX.4.64.0803070840150.32747@dildano.hawaga.org.uk >
	<47D14650.2000206@cs.uchicago.edu>
	<Pine.LNX.4.64.0803071541400.32747@dildano.hawaga.org.uk>
Message-ID: <47D1786D.4010609@cs.uchicago.edu>

OK, Zhao is working on it, and should get you those logs later today.

Ioan

Ben Clifford wrote:
> On Fri, 7 Mar 2008, Ioan Raicu wrote:
>
>   
>> Ideally, we'd want any extra logging outside of the bare minimum to be
>> optional, something that could be turned on or off depending on output level.
>> Maybe you or Mihael could work in such an option in the future, so we could
>> easily disable all logging in the wrapper script if we need to.  In the
>>     
>
> You can perhaps convince me easily (or unconvince me) if you provide the 
> -info files for a large test run. I implemented measurements there in 
> november to provide quantified results for almost exactly this situation. 
> You will considerably enhance this discourse by providing that information 
> now rather than later.
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


From zhaozhang at uchicago.edu  Fri Mar  7 18:18:16 2008
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Fri, 07 Mar 2008 18:18:16 -0600
Subject: [Swift-devel] Re: Question of wrapper.sh
In-Reply-To: <Pine.LNX.4.64.0803071638260.18350@dildano.hawaga.org.uk>
References: <47BEC111.8030503@mcs.anl.gov> <47C2E11F.2070208@mcs.anl.gov>
	<47C2E483.70005@cs.uchicago.edu>
	<47C2FB7B.3000901@uchicago.edu> <47C30053.8010608@mcs.anl.gov>
	<47C3E648.8040101@uchicago.edu> <47C421FC.10405@mcs.anl.gov>
	<47C429D0.9010109@cs.uchicago.edu> <47C43C40.50702@mcs.anl.gov>
	<47C43D15.8080608@cs.uchicago.edu> <47C440F9.6090701@mcs.anl.gov>
	<47C44238.9060007@cs.uchicago.edu>
	<47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov>
	<Pine.LNX.4.64.0802262356100.32747@dildano.hawaga.org.uk>
	<47C5039F.8020001@cs.uchicago.edu>
	<Pine.LNX.4.64.0802270632200.32747@dildano.hawaga.org.uk>
	<47D07BDC.60406@uchicago.edu>
	<Pine.LNX.4.64.0803062323120.32747@dildano.hawaga.org. uk>
	<47D0DA98.6010308@cs.uchicago.edu>
	<Pine.LNX.4.64.0803070840150.32747@dildano.hawaga.org.uk >
	<47D14650.2000206@cs.uchicago.edu>
	<Pine.LNX.4.64.0803071541400.32747@dildano.hawaga.org.uk>
	<47D16F6D.9070903@uchicago.edu>
	<Pine.LNX.4.64.0803071638260.18350@dildano.hawaga.org.uk>
Message-ID: <47D1DB48.4080709@uchicago.edu>

Hi,

I got this problem when I tried to connect Swift and Falkon. I could run 
this before, using the r1673 version swift. But not go through with the 
1716 version.

zzhang at login1.surveyor:~/cog/modules/vdsk/dist/vdsk-0.3-dev/examples/vdsk> 
swift
 -sites.file ../../etc/sites-BG.xml -tc.file ../../etc/tc-BG.data 
first.swift
Swift v0.3-dev swift-r1716 cog-r1934

RunID: 20080307-1813-i00h1fia
Progress:
echo started
error: Notification(int timeout): socket = new ServerSocket(recvPort); 
Address already in use
2008-03-07 18:13:48,122 WARN  submitQueue.NonBlockingSubmit 
[pool-1-thread-1,notifyPreviousQueue:71] Warning: Task handler throws 
exception and also sets status

the sites-BG.xml is like below
  <pool handle="bgp">
    <gridftp  url="local://localhost"/>
    <execution provider="deef" 
url="http://140.221.82.11:50001/wsrf/servicesGenericPortal/core/WS/GPFactoryService"/>
    <workdirectory >/home/zzhang</workdirectory>
  </pool>

along with the tc-BG.data file

# sitename  transformation  path   INSTALLED  platform  profiles
bgp     echo            /bin/echo       INSTALLED       INTEL32::LINUX  null
localhost       cat             /bin/cat        INSTALLED       
INTEL32::LINUX \
 null
localhost       ls              /bin/ls         INSTALLED       
INTEL32::LINUX \
 null
localhost       grep            /bin/grep       INSTALLED       
INTEL32::LINUX \
 null
localhost       sort            /bin/sort       INSTALLED       
INTEL32::LINUX \
 null
localhost       paste           /bin/paste      INSTALLED       
INTEL32::LINUX \
 null
bgp     sleep           /bin/sleep      INSTALLED       INTEL32::LINUX  null


zhao

Ben Clifford wrote:
> On Fri, 7 Mar 2008, Zhao Zhang wrote:
>
>   
>> Where do those -info files live, if I run SWIFT locally on a linux box?
>>     
>
> You will need to use a recent (as in r1715 swift, r1934 cog) build from 
> SVN. Edit your swift.properties file so that that this setting:
>
> wrapperlog.always.transfer=false
>
> is changed to this:
>
> wrapperlog.always.transfer=false
>
> Then when you run, you will get a log file called:
>
> whatever-20080101-9999-abcdef.log
>
> and a corresponding directory,
>
> whatever-20080101-9999-abcdef.d/
>
> In that directory, you should see one *-info file for each job that is run 
> (with the * being the jobid, the same as passed on the wrapper.sh command 
> line that we talked about 12h ago)
>
>   


From benc at hawaga.org.uk  Fri Mar  7 18:34:46 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sat, 8 Mar 2008 00:34:46 +0000 (GMT)
Subject: [Swift-devel] Re: Question of wrapper.sh
In-Reply-To: <47D1DB48.4080709@uchicago.edu>
References: <47BEC111.8030503@mcs.anl.gov> <47C2FB7B.3000901@uchicago.edu>
	<47C30053.8010608@mcs.anl.gov> <47C3E648.8040101@uchicago.edu>
	<47C421FC.10405@mcs.anl.gov> <47C429D0.9010109@cs.uchicago.edu>
	<47C43C40.50702@mcs.anl.gov> <47C43D15.8080608@cs.uchicago.edu>
	<47C440F9.6090701@mcs.anl.gov> <47C44238.9060007@cs.uchicago.edu>
	<47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov>
	<Pine.LNX.4.64.0802262356100.32747@dildano.hawaga.org.uk>
	<47C5039F.8020001@cs.uchicago.edu>
	<Pine.LNX.4.64.0802270632200.32747@dildano.hawaga.org.uk>
	<47D07BDC.60406@uchicago.edu>
	<Pine.LNX.4.64.0803062323120.32747@dildano.hawaga.org.
	uk> <47D0DA98.6010308@cs.uchicago.edu>
	<Pine.LNX.4.64.0803070840150.32747@dildano.hawaga.org.uk
	> <47D14650.2000206@cs.uchicago.edu>
	<Pine.LNX.4.64.0803071541400.32747@dildano.hawaga.org.uk>
	<47D16F6D.9070903@uchicago.edu>
	<Pine.LNX.4.64.0803071638260.18350@dildano.hawaga.org.uk>
	<47D1DB48.4080709@uchicago.edu>
Message-ID: <Pine.LNX.4.64.0803080033100.18350@dildano.hawaga.org.uk>


On Fri, 7 Mar 2008, Zhao Zhang wrote:

> error: Notification(int timeout): socket = new ServerSocket(recvPort); Address
> already in use

That's an error I'm not familiar with. At a guess, I'd say something like 
provider-deef is trying to open a server socket on a manually specified 
port (recvPort) that you already have something listening on.

Ioan, does that seem likely?

-- 


From benc at hawaga.org.uk  Fri Mar  7 18:40:54 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sat, 8 Mar 2008 00:40:54 +0000 (GMT)
Subject: [Swift-devel] Re: Question of wrapper.sh
In-Reply-To: <47D1DB48.4080709@uchicago.edu>
References: <47BEC111.8030503@mcs.anl.gov> <47C2FB7B.3000901@uchicago.edu>
	<47C30053.8010608@mcs.anl.gov> <47C3E648.8040101@uchicago.edu>
	<47C421FC.10405@mcs.anl.gov> <47C429D0.9010109@cs.uchicago.edu>
	<47C43C40.50702@mcs.anl.gov> <47C43D15.8080608@cs.uchicago.edu>
	<47C440F9.6090701@mcs.anl.gov> <47C44238.9060007@cs.uchicago.edu>
	<47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov>
	<Pine.LNX.4.64.0802262356100.32747@dildano.hawaga.org.uk>
	<47C5039F.8020001@cs.uchicago.edu>
	<Pine.LNX.4.64.0802270632200.32747@dildano.hawaga.org.uk>
	<47D07BDC.60406@uchicago.edu>
	<Pine.LNX.4.64.0803062323120.32747@dildano.hawaga.org.
	uk> <47D0DA98.6010308@cs.uchicago.edu>
	<Pine.LNX.4.64.0803070840150.32747@dildano.hawaga.org.uk
	> <47D14650.2000206@cs.uchicago.edu>
	<Pine.LNX.4.64.0803071541400.32747@dildano.hawaga.org.uk>
	<47D16F6D.9070903@uchicago.edu>
	<Pine.LNX.4.64.0803071638260.18350@dildano.hawaga.org.uk>
	<47D1DB48.4080709@uchicago.edu>
Message-ID: <Pine.LNX.4.64.0803080040220.18350@dildano.hawaga.org.uk>


On Fri, 7 Mar 2008, Zhao Zhang wrote:

> error: Notification(int timeout): socket = new ServerSocket(recvPort); Address
> already in use
> 2008-03-07 18:13:48,122 WARN  submitQueue.NonBlockingSubmit
> [pool-1-thread-1,notifyPreviousQueue:71] Warning: Task handler throws
> exception and also sets status

At that place in the log file, do you also get a stack trace? Put the 
whole log file somewhere i can see it.

-- 


From zhaozhang at uchicago.edu  Fri Mar  7 19:06:40 2008
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Fri, 07 Mar 2008 19:06:40 -0600
Subject: [Swift-devel] Re: Question of wrapper.sh
In-Reply-To: <Pine.LNX.4.64.0803080040220.18350@dildano.hawaga.org.uk>
References: <47BEC111.8030503@mcs.anl.gov> <47C30053.8010608@mcs.anl.gov>
	<47C3E648.8040101@uchicago.edu> <47C421FC.10405@mcs.anl.gov>
	<47C429D0.9010109@cs.uchicago.edu> <47C43C40.50702@mcs.anl.gov>
	<47C43D15.8080608@cs.uchicago.edu> <47C440F9.6090701@mcs.anl.gov>
	<47C44238.9060007@cs.uchicago.edu>
	<47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov>
	<Pine.LNX.4.64.0802262356100.32747@dildano.hawaga.org.uk>
	<47C5039F.8020001@cs.uchicago.edu>
	<Pine.LNX.4.64.0802270632200.32747@dildano.hawaga.org.uk>
	<47D07BDC.60406@uchicago.edu>
	<Pine.LNX.4.64.0803062323120.32747@dildano.hawaga.org. uk>
	<47D0DA98.6010308@cs.uchicago.edu>
	<Pine.LNX.4.64.0803070840150.32747@dildano.hawaga.org.uk >
	<47D14650.2000206@cs.uchicago.edu>
	<Pine.LNX.4.64.0803071541400.32747@dildano.hawaga.org.uk>
	<47D16F6D.9070903@uchicago.edu>
	<Pine.LNX.4.64.0803071638260.18350@dildano.hawaga.org.uk>
	<47D1DB48.4080709@uchicago.edu>
	<Pine.LNX.4.64.0803080040220.18350@dildano.hawaga.org.uk>
Message-ID: <47D1E6A0.3080805@uchicago.edu>

Hi, Ben

There is nothing in the sleep-20080307-1808-ik0izkmg.d directory, and 
also nothing in rlog file. When I could run it through falkon before, it 
still has that socket error, it is ok, swift will use another port.

zhao

Ben Clifford wrote:
> On Fri, 7 Mar 2008, Zhao Zhang wrote:
>
>   
>> error: Notification(int timeout): socket = new ServerSocket(recvPort); Address
>> already in use
>> 2008-03-07 18:13:48,122 WARN  submitQueue.NonBlockingSubmit
>> [pool-1-thread-1,notifyPreviousQueue:71] Warning: Task handler throws
>> exception and also sets status
>>     
>
> At that place in the log file, do you also get a stack trace? Put the 
> whole log file somewhere i can see it.
>
>   


From iraicu at cs.uchicago.edu  Fri Mar  7 19:49:09 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Fri, 07 Mar 2008 19:49:09 -0600
Subject: [Swift-devel] Re: Question of wrapper.sh
In-Reply-To: <Pine.LNX.4.64.0803080033100.18350@dildano.hawaga.org.uk>
References: <47BEC111.8030503@mcs.anl.gov> <47C30053.8010608@mcs.anl.gov>
	<47C3E648.8040101@uchicago.edu> <47C421FC.10405@mcs.anl.gov>
	<47C429D0.9010109@cs.uchicago.edu> <47C43C40.50702@mcs.anl.gov>
	<47C43D15.8080608@cs.uchicago.edu> <47C440F9.6090701@mcs.anl.gov>
	<47C44238.9060007@cs.uchicago.edu>
	<47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov>
	<Pine.LNX.4.64.0802262356100.32747@dildano.hawaga.org.uk>
	<47C5039F.8020001@cs.uchicago.edu>
	<Pine.LNX.4.64.0802270632200.32747@dildano.hawaga.org.uk>
	<47D07BDC.60406@uchicago.edu>
	<Pine.LNX.4.64.0803062323120.32747@dildano.hawaga.org. uk>
	<47D0DA98.6010308@cs.uchicago.edu>
	<Pine.LNX.4.64.0803070840150.32747@dildano.hawaga.org.uk >
	<47D14650.2000206@cs.uchicago.edu>
	<Pine.LNX.4.64.0803071541400.32747@dildano.hawaga.org.uk>
	<47D16F6D.9070903@uchicago.edu>
	<Pine.LNX.4.64.0803071638260.18350@dildano.hawaga.org.uk>
	<47D1DB48.4080709@uchicago.edu>
	<Pine.LNX.4.64.0803080033100.18350@dildano.hawaga.org.uk>
Message-ID: <47D1F095.6070207@cs.uchicago.edu>

That should be labeled as a warning... if that port is in use, it will 
try another one, so that is not the problem.

Ioan

Ben Clifford wrote:
> On Fri, 7 Mar 2008, Zhao Zhang wrote:
>
>   
>> error: Notification(int timeout): socket = new ServerSocket(recvPort); Address
>> already in use
>>     
>
> That's an error I'm not familiar with. At a guess, I'd say something like 
> provider-deef is trying to open a server socket on a manually specified 
> port (recvPort) that you already have something listening on.
>
> Ioan, does that seem likely?
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080307/6c3f1f2b/attachment.html>

From benc at hawaga.org.uk  Sat Mar  8 00:49:40 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sat, 8 Mar 2008 06:49:40 +0000 (GMT)
Subject: [Swift-devel] Re: Question of wrapper.sh
In-Reply-To: <47D1E6A0.3080805@uchicago.edu>
References: <47BEC111.8030503@mcs.anl.gov> <47C421FC.10405@mcs.anl.gov>
	<47C429D0.9010109@cs.uchicago.edu> <47C43C40.50702@mcs.anl.gov>
	<47C43D15.8080608@cs.uchicago.edu> <47C440F9.6090701@mcs.anl.gov>
	<47C44238.9060007@cs.uchicago.edu> <47C44EC2.4070901@uchicago.edu>
	<47C479EE.90901@mcs.anl.gov>
	<Pine.LNX.4.64.0802262356100.32747@dildano.hawaga.org.uk>
	<47C5039F.8020001@cs.uchicago.edu>
	<Pine.LNX.4.64.0802270632200.32747@dildano.hawaga.org.uk>
	<47D07BDC.60406@uchicago.edu>
	<Pine.LNX.4.64.0803062323120.32747@dildano.hawaga.org.
	uk> <47D0DA98.6010308@cs.uchicago.edu>
	<Pine.LNX.4.64.0803070840150.32747@dildano.hawaga.org.uk
	> <47D14650.2000206@cs.uchicago.edu>
	<Pine.LNX.4.64.0803071541400.32747@dildano.hawaga.org.uk>
	<47D16F6D.9070903@uchicago.edu>
	<Pine.LNX.4.64.0803071638260.18350@dildano.hawaga.org.uk>
	<47D1DB48.4080709@uchicago.edu>
	<Pine.LNX.4.64.0803080040220.18350@dildano.hawaga.org.uk>
	<47D1E6A0.3080805@uchicago.edu>
Message-ID: <Pine.LNX.4.64.0803080649110.18350@dildano.hawaga.org.uk>


On Fri, 7 Mar 2008, Zhao Zhang wrote:

> Hi, Ben
> 
> There is nothing in the sleep-20080307-1808-ik0izkmg.d directory, and also
> nothing in rlog file. When I could run it through falkon before, it still has
> that socket error, it is ok, swift will use another port.

You should get something in the .log file, though. Please send that.

-- 


From zhaozhang at uchicago.edu  Sat Mar  8 17:15:10 2008
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Sat, 08 Mar 2008 17:15:10 -0600
Subject: [Swift-devel] Re: Question of wrapper.sh
In-Reply-To: <Pine.LNX.4.64.0803080649110.18350@dildano.hawaga.org.uk>
References: <47BEC111.8030503@mcs.anl.gov> <47C421FC.10405@mcs.anl.gov>
	<47C429D0.9010109@cs.uchicago.edu> <47C43C40.50702@mcs.anl.gov>
	<47C43D15.8080608@cs.uchicago.edu> <47C440F9.6090701@mcs.anl.gov>
	<47C44238.9060007@cs.uchicago.edu>
	<47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov>
	<Pine.LNX.4.64.0802262356100.32747@dildano.hawaga.org.uk>
	<47C5039F.8020001@cs.uchicago.edu>
	<Pine.LNX.4.64.0802270632200.32747@dildano.hawaga.org.uk>
	<47D07BDC.60406@uchicago.edu>
	<Pine.LNX.4.64.0803062323120.32747@dildano.hawaga.org. uk>
	<47D0DA98.6010308@cs.uchicago.edu>
	<Pine.LNX.4.64.0803070840150.32747@dildano.hawaga.org.uk >
	<47D14650.2000206@cs.uchicago.edu>
	<Pine.LNX.4.64.0803071541400.32747@dildano.hawaga.org.uk>
	<47D16F6D.9070903@uchicago.edu>
	<Pine.LNX.4.64.0803071638260.18350@dildano.hawaga.org.uk>
	<47D1DB48.4080709@uchicago.edu>
	<Pine.LNX.4.64.0803080040220.18350@dildano.hawaga.org.uk>
	<47D1E6A0.3080805@uchicago.edu>
	<Pine.LNX.4.64.0803080649110.18350@dildano.hawaga.or! g.uk>
Message-ID: <47D31DFE.5060808@uchicago.edu>

Hi, Ben

I didn't find any .log files in the directory where I run the swift 
command and the working directory where the task should be executed.

zhao

Ben Clifford wrote:
> On Fri, 7 Mar 2008, Zhao Zhang wrote:
>
>   
>> Hi, Ben
>>
>> There is nothing in the sleep-20080307-1808-ik0izkmg.d directory, and also
>> nothing in rlog file. When I could run it through falkon before, it still has
>> that socket error, it is ok, swift will use another port.
>>     
>
> You should get something in the .log file, though. Please send that.
>
>   


From zhaozhang at uchicago.edu  Sat Mar  8 17:50:19 2008
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Sat, 08 Mar 2008 17:50:19 -0600
Subject: [Swift-devel] Re: Question of wrapper.sh
In-Reply-To: <Pine.LNX.4.64.0803080649110.18350@dildano.hawaga.org.uk>
References: <47BEC111.8030503@mcs.anl.gov> <47C421FC.10405@mcs.anl.gov>
	<47C429D0.9010109@cs.uchicago.edu> <47C43C40.50702@mcs.anl.gov>
	<47C43D15.8080608@cs.uchicago.edu> <47C440F9.6090701@mcs.anl.gov>
	<47C44238.9060007@cs.uchicago.edu>
	<47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov>
	<Pine.LNX.4.64.0802262356100.32747@dildano.hawaga.org.uk>
	<47C5039F.8020001@cs.uchicago.edu>
	<Pine.LNX.4.64.0802270632200.32747@dildano.hawaga.org.uk>
	<47D07BDC.60406@uchicago.edu>
	<Pine.LNX.4.64.0803062323120.32747@dildano.hawaga.org. uk>
	<47D0DA98.6010308@cs.uchicago.edu>
	<Pine.LNX.4.64.0803070840150.32747@dildano.hawaga.org.uk >
	<47D14650.2000206@cs.uchicago.edu>
	<Pine.LNX.4.64.0803071541400.32747@dildano.hawaga.org.uk>
	<47D16F6D.9070903@uchicago.edu>
	<Pine.LNX.4.64.0803071638260.18350@dildano.hawaga.org.uk>
	<47D1DB48.4080709@uchicago.edu>
	<Pine.LNX.4.64.0803080040220.18350@dildano.hawaga.org.uk>
	<47D1E6A0.3080805@uchicago.edu>
	<Pine.LNX.4.64.0803080649110.18350@dildano.hawaga.or! g.uk>
Message-ID: <47D3263B.4080705@uchicago.edu>

Hi, All

It is ok now, I solve that problem. It is a simple typo error in the 
GPfactoryservice path. Now swift could send task directly to falkon.

Then I submit 100 sleep 0 tasks from swift to falkon. It tool 50 seconds 
from the falkon's point of view to complete these tasks.

The info.tar file is the info files for these 100 sleep jobs.and the 
stdout.txt is that I get from the standard output from swift.

Thanks again for your help.

zhao


Ben Clifford wrote:
> On Fri, 7 Mar 2008, Zhao Zhang wrote:
>
>   
>> Hi, Ben
>>
>> There is nothing in the sleep-20080307-1808-ik0izkmg.d directory, and also
>> nothing in rlog file. When I could run it through falkon before, it still has
>> that socket error, it is ok, swift will use another port.
>>     
>
> You should get something in the .log file, though. Please send that.
>
>   
-------------- next part --------------
A non-text attachment was scrubbed...
Name: info.tar
Type: application/octet-stream
Size: 163840 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080308/bdddbf36/attachment.obj>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: stdout.txt
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080308/bdddbf36/attachment.txt>

From zhaozhang at uchicago.edu  Sun Mar  9 00:28:27 2008
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Sun, 09 Mar 2008 00:28:27 -0600
Subject: [Swift-devel] Re: Question of wrapper.sh
In-Reply-To: <Pine.LNX.4.64.0803080040220.18350@dildano.hawaga.org.uk>
References: <47BEC111.8030503@mcs.anl.gov> <47C30053.8010608@mcs.anl.gov>
	<47C3E648.8040101@uchicago.edu> <47C421FC.10405@mcs.anl.gov>
	<47C429D0.9010109@cs.uchicago.edu> <47C43C40.50702@mcs.anl.gov>
	<47C43D15.8080608@cs.uchicago.edu> <47C440F9.6090701@mcs.anl.gov>
	<47C44238.9060007@cs.uchicago.edu>
	<47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov>
	<Pine.LNX.4.64.0802262356100.32747@dildano.hawaga.org.uk>
	<47C5039F.8020001@cs.uchicago.edu>
	<Pine.LNX.4.64.0802270632200.32747@dildano.hawaga.org.uk>
	<47D07BDC.60406@uchicago.edu>
	<Pine.LNX.4.64.0803062323120.32747@dildano.hawaga.org. uk>
	<47D0DA98.6010308@cs.uchicago.edu>
	<Pine.LNX.4.64.0803070840150.32747@dildano.hawaga.org.uk >
	<47D14650.2000206@cs.uchicago.edu>
	<Pine.LNX.4.64.0803071541400.32747@dildano.hawaga.org.uk>
	<47D16F6D.9070903@uchicago.edu>
	<Pine.LNX.4.64.0803071638260.18350@dildano.hawaga.org.uk>
	<47D1DB48.4080709@uchicago.edu>
	<Pine.LNX.4.64.0803080040220.18350@dildano.hawaga.org.uk>
Message-ID: <47D3838B.3040506@uchicago.edu>

ok, here is the info files for the test of 500 sleep 0 on 1 P-SET.

zhao

Ben Clifford wrote:
> On Fri, 7 Mar 2008, Zhao Zhang wrote:
>
>   
>> error: Notification(int timeout): socket = new ServerSocket(recvPort); Address
>> already in use
>> 2008-03-07 18:13:48,122 WARN  submitQueue.NonBlockingSubmit
>> [pool-1-thread-1,notifyPreviousQueue:71] Warning: Task handler throws
>> exception and also sets status
>>     
>
> At that place in the log file, do you also get a stack trace? Put the 
> whole log file somewhere i can see it.
>
>   
-------------- next part --------------
A non-text attachment was scrubbed...
Name: info.tar
Type: application/octet-stream
Size: 778240 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080309/d1f3fc72/attachment.obj>

From wilde at mcs.anl.gov  Sun Mar  9 16:09:59 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 09 Mar 2008 16:09:59 -0500
Subject: [Swift-devel] Re: [Swft] Re: Question of wrapper.sh
In-Reply-To: <47D3838B.3040506@uchicago.edu>
References: <47BEC111.8030503@mcs.anl.gov> <47C3E648.8040101@uchicago.edu>
	<47C421FC.10405@mcs.anl.gov>	<47C429D0.9010109@cs.uchicago.edu>
	<47C43C40.50702@mcs.anl.gov>	<47C43D15.8080608@cs.uchicago.edu>
	<47C440F9.6090701@mcs.anl.gov>	<47C44238.9060007@cs.uchicago.edu>	<47C44EC2.4070901@uchicago.edu>
	<47C479EE.90901@mcs.anl.gov>	<Pine.LNX.4.64.0802262356100.32747@dildano.hawaga.org.uk>	<47C5039F.8020001@cs.uchicago.edu>	<Pine.LNX.4.64.0802270632200.32747@dildano.hawaga.org.uk>	<47D07BDC.60406@uchicago.edu>	<Pine.LNX.4.64.0803062323120.32747@dildano.hawaga.org.
	uk>	<47D0DA98.6010308@cs.uchicago.edu>	<Pine.LNX.4.64.0803070840150.32747@dildano.hawaga.org.uk
	>	<47D14650.2000206@cs.uchicago.edu>	<Pine.LNX.4.64.0803071541400.32747@dildano.hawaga.org.uk>	<47D16F6D.9070903@uchicago.edu>	<Pine.LNX.4.64.0803071638260.18350@dildano.hawaga.org.uk>	<47D1DB48.4080709@uchicago.edu>	<Pine.LNX.4.64.0803080040220.18350@dildano.hawaga.org.uk>
	<47D3838B.3040506@uchicago.edu>
Message-ID: <47D45227.8020201@mcs.anl.gov>

(taking swft off cc list)

Zhao, Ive lost track of what these logs are for.

This run is using all 64 compute nodes on 1 pset, right?

I compute these stats for these logs:

nfiles=500 Avg=1.27 secs, Min=0 secs, Max=4 secs, Run Duration=138 secs

So not too much variation in run time, but pretty slow.

- mike

On 3/9/08 12:28 AM, Zhao Zhang wrote:
> ok, here is the info files for the test of 500 sleep 0 on 1 P-SET.
> 
> zhao
> 
> Ben Clifford wrote:
>> On Fri, 7 Mar 2008, Zhao Zhang wrote:
>>
>>  
>>> error: Notification(int timeout): socket = new 
>>> ServerSocket(recvPort); Address
>>> already in use
>>> 2008-03-07 18:13:48,122 WARN  submitQueue.NonBlockingSubmit
>>> [pool-1-thread-1,notifyPreviousQueue:71] Warning: Task handler throws
>>> exception and also sets status
>>>     
>>
>> At that place in the log file, do you also get a stack trace? Put the 
>> whole log file somewhere i can see it.
>>
>>   


From wilde at mcs.anl.gov  Sun Mar  9 16:32:16 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 09 Mar 2008 16:32:16 -0500
Subject: [Swift-devel] Can scp data provider be used with Swift?
Message-ID: <47D45760.4060509@mcs.anl.gov>

We'd like to do a Swift run on the SiCortex machine using Falkon as the 
execution provider.

At the moment there's no Java on the SiCortex and no ready access to its 
filesystem from a Linux host with Java.

Is it feasible to run Swift on a Linux host with Falkon for the job 
provider and scp for the data provider? If so, would this be specified?


From benc at hawaga.org.uk  Sun Mar  9 19:29:05 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 10 Mar 2008 00:29:05 +0000 (GMT)
Subject: [Swift-devel] Re: Question of wrapper.sh
In-Reply-To: <47D3838B.3040506@uchicago.edu>
References: <47BEC111.8030503@mcs.anl.gov> <47C421FC.10405@mcs.anl.gov>
	<47C429D0.9010109@cs.uchicago.edu> <47C43C40.50702@mcs.anl.gov>
	<47C43D15.8080608@cs.uchicago.edu> <47C440F9.6090701@mcs.anl.gov>
	<47C44238.9060007@cs.uchicago.edu> <47C44EC2.4070901@uchicago.edu>
	<47C479EE.90901@mcs.anl.gov>
	<Pine.LNX.4.64.0802262356100.32747@dildano.hawaga.org.uk>
	<47C5039F.8020001@cs.uchicago.edu>
	<Pine.LNX.4.64.0802270632200.32747@dildano.hawaga.org.uk>
	<47D07BDC.60406@uchicago.edu>
	<Pine.LNX.4.64.0803062323120.32747@dildano.hawaga.org.
	uk> <47D0DA98.6010308@cs.uchicago.edu>
	<Pine.LNX.4.64.0803070840150.32747@dildano.hawaga.org.uk
	> <47D14650.2000206@cs.uchicago.edu>
	<Pine.LNX.4.64.0803071541400.32747@dildano.hawaga.org.uk>
	<47D16F6D.9070903@uchicago.edu>
	<Pine.LNX.4.64.0803071638260.18350@dildano.hawaga.org.uk>
	<47D1DB48.4080709@uchicago.edu>
	<Pine.LNX.4.64.0803080040220.18350@dildano.hawaga.org.uk>
	<47D3838B.3040506@uchicago.edu>
Message-ID: <Pine.LNX.4.64.0803100028080.18350@dildano.hawaga.org.uk>


ok. I will look that those. At the same time that you send -info logs, 
please can you send the main log file (that is named something like 
foo-20080101-000-abcdef.log) as that has interesting information too.

On Sun, 9 Mar 2008, Zhao Zhang wrote:

> ok, here is the info files for the test of 500 sleep 0 on 1 P-SET.
> 
> zhao
> 
> Ben Clifford wrote:
> > On Fri, 7 Mar 2008, Zhao Zhang wrote:
> > 
> >   
> > > error: Notification(int timeout): socket = new ServerSocket(recvPort);
> > > Address
> > > already in use
> > > 2008-03-07 18:13:48,122 WARN  submitQueue.NonBlockingSubmit
> > > [pool-1-thread-1,notifyPreviousQueue:71] Warning: Task handler throws
> > > exception and also sets status
> > >     
> > 
> > At that place in the log file, do you also get a stack trace? Put the whole
> > log file somewhere i can see it.
> > 
> >   


From benc at hawaga.org.uk  Sun Mar  9 19:47:09 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 10 Mar 2008 00:47:09 +0000 (GMT)
Subject: [Swift-devel] Re: Question of wrapper.sh
In-Reply-To: <Pine.LNX.4.64.0803100028080.18350@dildano.hawaga.org.uk>
References: <47BEC111.8030503@mcs.anl.gov> <47C429D0.9010109@cs.uchicago.edu>
	<47C43C40.50702@mcs.anl.gov> <47C43D15.8080608@cs.uchicago.edu>
	<47C440F9.6090701@mcs.anl.gov> <47C44238.9060007@cs.uchicago.edu>
	<47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov>
	<Pine.LNX.4.64.0802262356100.32747@dildano.hawaga.org.uk>
	<47C5039F.8020001@cs.uchicago.edu>
	<Pine.LNX.4.64.0802270632200.32747@dildano.hawaga.org.uk>
	<47D07BDC.60406@uchicago.edu>
	<Pine.LNX.4.64.0803062323120.32747@dildano.hawaga.org.
	uk> <47D0DA98.6010308@cs.uchicago.edu>
	<Pine.LNX.4.64.0803070840150.32747@dildano.hawaga.org.uk
	> <47D14650.2000206@cs.uchicago.edu>
	<Pine.LNX.4.64.0803071541400.32747@dildano.hawaga.org.uk>
	<47D16F6D.9070903@uchicago.edu>
	<Pine.LNX.4.64.0803071638260.18350@dildano.hawaga.org.uk>
	<47D1DB48.4080709@uchicago.edu>
	<Pine.LNX.4.64.0803080040220.18350@dildano.hawaga.org.uk>
	<47D3838B.3040506@uchicago.edu>
	<Pine.LNX.4.64.0803100028080.18350@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0803100044250.18350@dildano.hawaga.org.uk>


Whichever version of date that you used, it doesn't support more than 1s 
accuracy in its output. That's irksome because its that subsecond accuracy 
that I wanted from these log files.

date on os x doesn't have that precision, fairly recent (in the past 
couple of years at least) GNU coreutils date does. It would be nice if you 
could find if that is installed and use that instead... 

-- 


From iraicu at cs.uchicago.edu  Sun Mar  9 19:51:20 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Sun, 09 Mar 2008 19:51:20 -0500
Subject: [Swift-devel] Re: Question of wrapper.sh
In-Reply-To: <Pine.LNX.4.64.0803100044250.18350@dildano.hawaga.org.uk>
References: <47BEC111.8030503@mcs.anl.gov> <47C43C40.50702@mcs.anl.gov>
	<47C43D15.8080608@cs.uchicago.edu> <47C440F9.6090701@mcs.anl.gov>
	<47C44238.9060007@cs.uchicago.edu>
	<47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov>
	<Pine.LNX.4.64.0802262356100.32747@dildano.hawaga.org.uk>
	<47C5039F.8020001@cs.uchicago.edu>
	<Pine.LNX.4.64.0802270632200.32747@dildano.hawaga.org.uk>
	<47D07BDC.60406@uchicago.edu>
	<Pine.LNX.4.64.0803062323120.32747@dildano.hawaga.org. uk>
	<47D0DA98.6010308@cs.uchicago.edu>
	<Pine.LNX.4.64.0803070840150.32747@dildano.hawaga.org.uk >
	<47D14650.2000206@cs.uchicago.edu>
	<Pine.LNX.4.64.0803071541400.32747@dildano.hawaga.org.uk>
	<47D16F6D.9070903@uchicago.edu>
	<Pine.LNX.4.64.0803071638260.18350@dildano.hawaga.org.uk>
	<47D1DB48.4080709@uchicago.edu>
	<Pine.LNX.4.64.0803080040220.18350@dildano.hawaga.org.uk>
	<47D3838B.3040506@uchicago.edu>
	<Pine.LNX.4.64.0803100028080.18350@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803100044250.18350@dildano.hawaga.org.uk>
Message-ID: <47D48608.2060204@cs.uchicago.edu>

Ben,
Those logs are only for 1 CPU, so most things will take less than 1 
sec.  In the case where we use 100s of CPUs (a basic scenario for the 
BG/P), things will take 10s of seconds, so 1 sec resolution should be 
OK.  Zhao, did you re-run that test with 256 CPUs?  Ben should be 
looking at those logs, not the 1 CPU case.

Ioan

Ben Clifford wrote:
> Whichever version of date that you used, it doesn't support more than 1s 
> accuracy in its output. That's irksome because its that subsecond accuracy 
> that I wanted from these log files.
>
> date on os x doesn't have that precision, fairly recent (in the past 
> couple of years at least) GNU coreutils date does. It would be nice if you 
> could find if that is installed and use that instead... 
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


From benc at hawaga.org.uk  Sun Mar  9 19:54:05 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 10 Mar 2008 00:54:05 +0000 (GMT)
Subject: [Swift-devel] Re: Question of wrapper.sh
In-Reply-To: <47D48608.2060204@cs.uchicago.edu>
References: <47BEC111.8030503@mcs.anl.gov> <47C440F9.6090701@mcs.anl.gov>
	<47C44238.9060007@cs.uchicago.edu> <47C44EC2.4070901@uchicago.edu>
	<47C479EE.90901@mcs.anl.gov>
	<Pine.LNX.4.64.0802262356100.32747@dildano.hawaga.org.uk>
	<47C5039F.8020001@cs.uchicago.edu>
	<Pine.LNX.4.64.0802270632200.32747@dildano.hawaga.org.uk>
	<47D07BDC.60406@uchicago.edu>
	<Pine.LNX.4.64.0803062323120.32747@dildano.hawaga.org.
	uk> <47D0DA98.6010308@cs.uchicago.edu>
	<Pine.LNX.4.64.0803070840150.32747@dildano.hawaga.org.uk
	> <47D14650.2000206@cs.uchicago.edu>
	<Pine.LNX.4.64.0803071541400.32747@dildano.hawaga.org.uk>
	<47D16F6D.9070903@uchicago.edu>
	<Pine.LNX.4.64.0803071638260.18350@dildano.hawaga.org.uk>
	<47D1DB48.4080709@uchicago.edu>
	<Pine.LNX.4.64.0803080040220.18350@dildano.hawaga.org.uk>
	<47D3838B.3040506@uchicago.edu>
	<Pine.LNX.4.64.0803100028080.18350@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803100044250.18350@dildano.hawaga.org.uk>
	<47D48608.2060204@cs.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0803100053080.18350@dildano.hawaga.org.uk>


On Sun, 9 Mar 2008, Ioan Raicu wrote:

> Those logs are only for 1 CPU, so most things will take less than 1 sec.  In
> the case where we use 100s of CPUs (a basic scenario for the BG/P), things
> will take 10s of seconds, so 1 sec resolution should be OK.  Zhao, did you
> re-run that test with 256 CPUs?  Ben should be looking at those logs, not the
> 1 CPU case.

OK. That will be more interesting then. Zhao, please send the regular .log 
file at the same time too.

-- 


From zhaozhang at uchicago.edu  Sun Mar  9 20:32:01 2008
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Sun, 09 Mar 2008 20:32:01 -0500
Subject: [Swift-devel] Re: Question of wrapper.sh
In-Reply-To: <47D48608.2060204@cs.uchicago.edu>
References: <47BEC111.8030503@mcs.anl.gov> <47C43D15.8080608@cs.uchicago.edu>
	<47C440F9.6090701@mcs.anl.gov> <47C44238.9060007@cs.uchicago.edu>
	<47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov>
	<Pine.LNX.4.64.0802262356100.32747@dildano.hawaga.org.uk>
	<47C5039F.8020001@cs.uchicago.edu>
	<Pine.LNX.4.64.0802270632200.32747@dildano.hawaga.org.uk>
	<47D07BDC.60406@uchicago.edu>
	<Pine.LNX.4.64.0803062323120.32747@dildano.hawaga.org. uk>
	<47D0DA98.6010308@cs.uchicago.edu>
	<Pine.LNX.4.64.0803070840150.32747@dildano.hawaga.org.uk >
	<47D14650.2000206@cs.uchicago.edu>
	<Pine.LNX.4.64.0803071541400.32747@dildano.hawaga.org.uk>
	<47D16F6D.9070903@uchicago.edu>
	<Pine.LNX.4.64.0803071638260.18350@dildano.hawaga.org.uk>
	<47D1DB48.4080709@uchicago.edu>
	<Pine.LNX.4.64.0803080040220.18350@dildano.hawaga.org.uk>
	<47D3838B.3040506@uchicago.edu>
	<Pine.LNX.4.64.0803100028080.18350@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803100044250.18350@dildano.hawaga.org.uk>
	<47D48608.2060204@cs.uchicago! .edu>
Message-ID: <47D48F91.6040802@uchicago.edu>

well, the tar ball I sent in the last email is from 256 cores, are they 
from the same cpu? By the way, I tried to find the .log files, but there 
isn't any in the folder where I started the swift script.

zhao

Ioan Raicu wrote:
> Ben,
> Those logs are only for 1 CPU, so most things will take less than 1 
> sec.  In the case where we use 100s of CPUs (a basic scenario for the 
> BG/P), things will take 10s of seconds, so 1 sec resolution should be 
> OK.  Zhao, did you re-run that test with 256 CPUs?  Ben should be 
> looking at those logs, not the 1 CPU case.
>
> Ioan
>
> Ben Clifford wrote:
>> Whichever version of date that you used, it doesn't support more than 
>> 1s accuracy in its output. That's irksome because its that subsecond 
>> accuracy that I wanted from these log files.
>>
>> date on os x doesn't have that precision, fairly recent (in the past 
>> couple of years at least) GNU coreutils date does. It would be nice 
>> if you could find if that is installed and use that instead...
>>   
>


From benc at hawaga.org.uk  Sun Mar  9 20:51:07 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 10 Mar 2008 01:51:07 +0000 (GMT)
Subject: [Swift-devel] Re: Question of wrapper.sh
In-Reply-To: <47D48F91.6040802@uchicago.edu>
References: <47BEC111.8030503@mcs.anl.gov> <47C44238.9060007@cs.uchicago.edu>
	<47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov>
	<Pine.LNX.4.64.0802262356100.32747@dildano.hawaga.org.uk>
	<47C5039F.8020001@cs.uchicago.edu>
	<Pine.LNX.4.64.0802270632200.32747@dildano.hawaga.org.uk>
	<47D07BDC.60406@uchicago.edu>
	<Pine.LNX.4.64.0803062323120.32747@dildano.hawaga.org.
	uk> <47D0DA98.6010308@cs.uchicago.edu>
	<Pine.LNX.4.64.0803070840150.32747@dildano.hawaga.org.uk
	> <47D14650.2000206@cs.uchicago.edu>
	<Pine.LNX.4.64.0803071541400.32747@dildano.hawaga.org.uk>
	<47D16F6D.9070903@uchicago.edu>
	<Pine.LNX.4.64.0803071638260.18350@dildano.hawaga.org.uk>
	<47D1DB48.4080709@uchicago.edu>
	<Pine.LNX.4.64.0803080040220.18350@dildano.hawaga.org.uk>
	<47D3838B.3040506@uchicago.edu>
	<Pine.LNX.4.64.0803100028080.18350@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803100044250.18350@dildano.hawaga.org.uk>
	<47D48608.2060204@cs.uchicago!
	.edu> <47D48F91.6040802@uchicago.edu>
Message-ID: <Pine.LNX.4.64.0803100138090.18350@dildano.hawaga.org.uk>


On Sun, 9 Mar 2008, Zhao Zhang wrote:

> well, the tar ball I sent in the last email is from 256 cores, are they from
> the same cpu? By the way, I tried to find the .log files, but there isn't any
> in the folder where I started the swift script.

How are you building provider-deef? The old way messes up logging. Since 
r1525 in December, a way to build that doesn't do this is to build swift 
and provider-deef at the same time, by using this command in the vdsk 
directory:

  ant -Dwith-provider-deef redist

You'll get a warning like this:
 [input] Warning! The specified target directory 
(/Users/benc/work/cog/modules/swift/../..//modules/swift/dist/swift-0.3-dev) 
does not seem to contain a Swift build. 

which is not a problem - press return a few times to get the build to 
continue.

-- 


From zhaozhang at uchicago.edu  Sun Mar  9 20:55:53 2008
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Sun, 09 Mar 2008 20:55:53 -0500
Subject: [Swift-devel] Re: Question of wrapper.sh
In-Reply-To: <Pine.LNX.4.64.0803100138090.18350@dildano.hawaga.org.uk>
References: <47BEC111.8030503@mcs.anl.gov> <47C44EC2.4070901@uchicago.edu>
	<47C479EE.90901@mcs.anl.gov>
	<Pine.LNX.4.64.0802262356100.32747@dildano.hawaga.org.uk>
	<47C5039F.8020001@cs.uchicago.edu>
	<Pine.LNX.4.64.0802270632200.32747@dildano.hawaga.org.uk>
	<47D07BDC.60406@uchicago.edu>
	<Pine.LNX.4.64.0803062323120.32747@dildano.hawaga.org. uk>
	<47D0DA98.6010308@cs.uchicago.edu>
	<Pine.LNX.4.64.0803070840150.32747@dildano.hawaga.org.uk >
	<47D14650.2000206@cs.uchicago.edu>
	<Pine.LNX.4.64.0803071541400.32747@dildano.hawaga.org.uk>
	<47D16F6D.9070903@uchicago.edu>
	<Pine.LNX.4.64.0803071638260.18350@dildano.hawaga.org.uk>
	<47D1DB48.4080709@uchicago.edu>
	<Pine.LNX.4.64.0803080040220.18350@dildano.hawaga.org.uk>
	<47D3838B.3040506@uchicago.edu>
	<Pine.LNX.4.64.0803100028080.18350@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803100044250.18350@dildano.hawaga.org.uk>
	<47D48608.2060204@cs.uchicago! .edu>
	<47D48F91.6040802@uchicago.edu>
	<Pine.LNX.4.64.0803100138090.18350@dildano.hawaga.org.uk>
Message-ID: <47D49529.1020809@uchicago.edu>

Hi, Ben

Here is the script that we are using to build the provider

#/bin/sh


if [ -z "${FALKON_ROOT}" ]; then
    echo "ERROR: environment variable FALKON_ROOT not defined"  1>&2
    return 1
fi

if [ ! -d "${FALKON_ROOT}" ]; then
    echo "ERROR: invalid FALKON_ROOT set: $FALKON_ROOT" 1>&2
    return 1
fi

cd ${FALKON_ROOT}/cog/modules/provider-deef

ant distclean

ant -Ddist.dir=../vdsk/dist/vdsk-0.3-dev/ dist


So I need to delete the last line and add

ant -Dwith-provider-deef -Ddist.dir=../vdsk/dist/vdsk-0.3-dev/ redist


right?


zhao
Ben Clifford wrote:
> On Sun, 9 Mar 2008, Zhao Zhang wrote:
>
>   
>> well, the tar ball I sent in the last email is from 256 cores, are they from
>> the same cpu? By the way, I tried to find the .log files, but there isn't any
>> in the folder where I started the swift script.
>>     
>
> How are you building provider-deef? The old way messes up logging. Since 
> r1525 in December, a way to build that doesn't do this is to build swift 
> and provider-deef at the same time, by using this command in the vdsk 
> directory:
>
>   ant -Dwith-provider-deef redist
>
> You'll get a warning like this:
>  [input] Warning! The specified target directory 
> (/Users/benc/work/cog/modules/swift/../..//modules/swift/dist/swift-0.3-dev) 
> does not seem to contain a Swift build. 
>
> which is not a problem - press return a few times to get the build to 
> continue.
>
>   


From benc at hawaga.org.uk  Sun Mar  9 20:58:51 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 10 Mar 2008 01:58:51 +0000 (GMT)
Subject: [Swift-devel] Re: Question of wrapper.sh
In-Reply-To: <47D49529.1020809@uchicago.edu>
References: <47BEC111.8030503@mcs.anl.gov>
	<Pine.LNX.4.64.0802262356100.32747@dildano.hawaga.org.uk>
	<47C5039F.8020001@cs.uchicago.edu>
	<Pine.LNX.4.64.0802270632200.32747@dildano.hawaga.org.uk>
	<47D07BDC.60406@uchicago.edu>
	<Pine.LNX.4.64.0803062323120.32747@dildano.hawaga.org.
	uk> <47D0DA98.6010308@cs.uchicago.edu>
	<Pine.LNX.4.64.0803070840150.32747@dildano.hawaga.org.uk
	> <47D14650.2000206@cs.uchicago.edu>
	<Pine.LNX.4.64.0803071541400.32747@dildano.hawaga.org.uk>
	<47D16F6D.9070903@uchicago.edu>
	<Pine.LNX.4.64.0803071638260.18350@dildano.hawaga.org.uk>
	<47D1DB48.4080709@uchicago.edu>
	<Pine.LNX.4.64.0803080040220.18350@dildano.hawaga.org.uk>
	<47D3838B.3040506@uchicago.edu>
	<Pine.LNX.4.64.0803100028080.18350@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803100044250.18350@dildano.hawaga.org.uk>
	<47D48608.2060204@cs.uchicago!
	.edu> <47D48F91.6040802@uchicago.edu>
	<Pine.LNX.4.64.0803100138090.18350@dildano.hawaga.org.uk>
	<47D49529.1020809@uchicago.edu>
Message-ID: <Pine.LNX.4.64.0803100157160.18350@dildano.hawaga.org.uk>


On Sun, 9 Mar 2008, Zhao Zhang wrote:

> cd ${FALKON_ROOT}/cog/modules/provider-deef
> ant distclean
> ant -Ddist.dir=../vdsk/dist/vdsk-0.3-dev/ dist


> So I need to delete the last line and add
> 
> ant -Dwith-provider-deef -Ddist.dir=../vdsk/dist/vdsk-0.3-dev/ redist

don't run any build command in the provider-deef directory.

replace the above three lines with:

 cd ${FALKON_ROOT}/cog/modules/vdsk/
 ant -Dwith-provider-deef redist 

-- 


From iraicu at cs.uchicago.edu  Mon Mar 10 00:55:03 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Mon, 10 Mar 2008 00:55:03 -0500
Subject: [Swift-devel] Re: Question of wrapper.sh
In-Reply-To: <47D48F91.6040802@uchicago.edu>
References: <47BEC111.8030503@mcs.anl.gov> <47C440F9.6090701@mcs.anl.gov>
	<47C44238.9060007@cs.uchicago.edu>
	<47C44EC2.4070901@uchicago.edu> <47C479EE.90901@mcs.anl.gov>
	<Pine.LNX.4.64.0802262356100.32747@dildano.hawaga.org.uk>
	<47C5039F.8020001@cs.uchicago.edu>
	<Pine.LNX.4.64.0802270632200.32747@dildano.hawaga.org.uk>
	<47D07BDC.60406@uchicago.edu>
	<Pine.LNX.4.64.0803062323120.32747@dildano.hawaga.org. uk>
	<47D0DA98.6010308@cs.uchicago.edu>
	<Pine.LNX.4.64.0803070840150.32747@dildano.hawaga.org.uk >
	<47D14650.2000206@cs.uchicago.edu>
	<Pine.LNX.4.64.0803071541400.32747@dildano.hawaga.org.uk>
	<47D16F6D.9070903@uchicago.edu>
	<Pine.LNX.4.64.0803071638260.18350@dildano.hawaga.org.uk>
	<47D1DB48.4080709@uchicago.edu>
	<Pine.LNX.4.64.0803080040220.18350@dildano.hawaga.org.uk>
	<47D3838B.3040506@uchicago.edu>
	<Pine.LNX.4.64.0803100028080.18350@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803100044250.18350@dildano.hawaga.org.uk>
	<47D48608.2060204@cs.uchicago! .edu> <47D48F91.6040802@uchicago.e
	du>
Message-ID: <47D4CD37.7020006@cs.uchicago.edu>

I am a little bit behind here on the emails... based on the Falkon logs, 
it seems that the low throughput we are getting in the latest Swift runs 
is due to throttling.  Where are all the various throttling parameters 
that we should change, to ensure that Swift submits to Falkon as fast as 
possible with all available jobs?  I assume there is a jobs/sec 
throttle, a maximum number of outstanding jobs (i.e. falkon queued jobs 
+ running jobs), and maybe others.
Thanks,
Ioan

Zhao Zhang wrote:
> well, the tar ball I sent in the last email is from 256 cores, are 
> they from the same cpu? By the way, I tried to find the .log files, 
> but there isn't any in the folder where I started the swift script.
>
> zhao
>
> Ioan Raicu wrote:
>> Ben,
>> Those logs are only for 1 CPU, so most things will take less than 1 
>> sec.  In the case where we use 100s of CPUs (a basic scenario for the 
>> BG/P), things will take 10s of seconds, so 1 sec resolution should be 
>> OK.  Zhao, did you re-run that test with 256 CPUs?  Ben should be 
>> looking at those logs, not the 1 CPU case.
>>
>> Ioan
>>
>> Ben Clifford wrote:
>>> Whichever version of date that you used, it doesn't support more 
>>> than 1s accuracy in its output. That's irksome because its that 
>>> subsecond accuracy that I wanted from these log files.
>>>
>>> date on os x doesn't have that precision, fairly recent (in the past 
>>> couple of years at least) GNU coreutils date does. It would be nice 
>>> if you could find if that is installed and use that instead...
>>>   
>>
>

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


From benc at hawaga.org.uk  Mon Mar 10 01:23:47 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 10 Mar 2008 06:23:47 +0000 (GMT)
Subject: [Swift-devel] Re: Question of wrapper.sh
In-Reply-To: <47D4CD37.7020006@cs.uchicago.edu>
References: <47BEC111.8030503@mcs.anl.gov> <47C44EC2.4070901@uchicago.edu>
	<47C479EE.90901@mcs.anl.gov>
	<Pine.LNX.4.64.0802262356100.32747@dildano.hawaga.org.uk>
	<47C5039F.8020001@cs.uchicago.edu>
	<Pine.LNX.4.64.0802270632200.32747@dildano.hawaga.org.uk>
	<47D07BDC.60406@uchicago.edu>
	<Pine.LNX.4.64.0803062323120.32747@dildano.hawaga.org.
	uk> <47D0DA98.6010308@cs.uchicago.edu>
	<Pine.LNX.4.64.0803070840150.32747@dildano.hawaga.org.uk
	> <47D14650.2000206@cs.uchicago.edu>
	<Pine.LNX.4.64.0803071541400.32747@dildano.hawaga.org.uk>
	<47D16F6D.9070903@uchicago.edu>
	<Pine.LNX.4.64.0803071638260.18350@dildano.hawaga.org.uk>
	<47D1DB48.4080709@uchicago.edu>
	<Pine.LNX.4.64.0803080040220.18350@dildano.hawaga.org.uk>
	<47D3838B.3040506@uchicago.edu>
	<Pine.LNX.4.64.0803100028080.18350@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803100044250.18350@dildano.hawaga.org.uk>
	<47D48608.2060204@cs.uchicago!
	.edu> <47D48F91.6040802@uchicago.e du>
	<47D4CD37.7020006@cs.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0803100609520.18350@dildano.hawaga.org.uk>


The user guide has a section on properties that you can configure - in 
here, 
http://www.ci.uchicago.edu/swift/guides/userguide.php#engineconfiguration 
pretty much anything with the word 'throttle' in it.

If you give me the .log files for runs, I can look at what the rate 
control stuff is doing.

In the past day or so, I've reduced throttle.score.job.factor 
substantially, to be more appropriate for GRAM2 submission - I suspect for 
something like Falkon you should make it much higher. It used to be 4 
(which means 402 jobs executing at once maximum), but is now 0.2 (20 jobs 
at once maximum). For Falkon on a large number of CPUs, you probably want 
to make that higher (maybe number of CPUs divided by about 30)

On Mon, 10 Mar 2008, Ioan Raicu wrote:

> I am a little bit behind here on the emails... based on the Falkon logs, it
> seems that the low throughput we are getting in the latest Swift runs is due
> to throttling.  Where are all the various throttling parameters that we should
> change, to ensure that Swift submits to Falkon as fast as possible with all
> available jobs?  I assume there is a jobs/sec throttle, a maximum number of
> outstanding jobs (i.e. falkon queued jobs + running jobs), and maybe others.
> Thanks,
> Ioan
> 
> Zhao Zhang wrote:
> > well, the tar ball I sent in the last email is from 256 cores, are they from
> > the same cpu? By the way, I tried to find the .log files, but there isn't
> > any in the folder where I started the swift script.
> > 
> > zhao
> > 
> > Ioan Raicu wrote:
> > > Ben,
> > > Those logs are only for 1 CPU, so most things will take less than 1 sec.
> > > In the case where we use 100s of CPUs (a basic scenario for the BG/P),
> > > things will take 10s of seconds, so 1 sec resolution should be OK.  Zhao,
> > > did you re-run that test with 256 CPUs?  Ben should be looking at those
> > > logs, not the 1 CPU case.
> > > 
> > > Ioan
> > > 
> > > Ben Clifford wrote:
> > > > Whichever version of date that you used, it doesn't support more than 1s
> > > > accuracy in its output. That's irksome because its that subsecond
> > > > accuracy that I wanted from these log files.
> > > > 
> > > > date on os x doesn't have that precision, fairly recent (in the past
> > > > couple of years at least) GNU coreutils date does. It would be nice if
> > > > you could find if that is installed and use that instead...
> > > >   
> > > 
> > 
> 
> 


From hategan at mcs.anl.gov  Mon Mar 10 07:16:21 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 10 Mar 2008 07:16:21 -0500
Subject: [Swift-devel] Failed null
In-Reply-To: <Pine.LNX.4.64.0803061725150.32747@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0803061725150.32747@dildano.hawaga.org.uk>
Message-ID: <1205151381.11504.8.camel@blabla.mcs.anl.gov>

Smells of NPE.

On Thu, 2008-03-06 at 17:26 +0000, Ben Clifford wrote:
> I'm seeing plenty of errors during stagein that look like this:
> 
> 2008-03-06 10:42:02,911-0600 DEBUG TaskImpl Task(type=FILE_TRANSFER, 
> identity=urn:0-624-4-1204821602563) setting status to Submitting
> 2008-03-06 10:42:02,911-0600 DEBUG TaskImpl Task(type=FILE_TRANSFER, 
> identity=urn:0-624-4-1204821602563) setting status to Submitted
> 2008-03-06 10:42:02,912-0600 DEBUG TaskImpl Task(type=FILE_TRANSFER, 
> identity=urn:0-624-4-1204821602563) setting status to Active
> 2008-03-06 10:42:02,937-0600 DEBUG TaskImpl Task(type=FILE_TRANSFER, 
> identity=urn:0-624-4-1204821602563) setting status to Failed null
> 
> 'null' is not so helpful - also there's indicating which file was 
> attempting to be transferred here...


From hategan at mcs.anl.gov  Mon Mar 10 07:17:43 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 10 Mar 2008 07:17:43 -0500
Subject: [Swift-devel] Can scp data provider be used with Swift?
In-Reply-To: <47D45760.4060509@mcs.anl.gov>
References: <47D45760.4060509@mcs.anl.gov>
Message-ID: <1205151463.11504.10.camel@blabla.mcs.anl.gov>


On Sun, 2008-03-09 at 16:32 -0500, Michael Wilde wrote:
> We'd like to do a Swift run on the SiCortex machine using Falkon as the 
> execution provider.
> 
> At the moment there's no Java on the SiCortex and no ready access to its 
> filesystem from a Linux host with Java.
> 
> Is it feasible to run Swift on a Linux host with Falkon for the job 
> provider and scp for the data provider?

It's possible. I2U2 does it.

>  If so, would this be specified?

I'm fuzzy about it at the moment and battery is low...

> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From benc at hawaga.org.uk  Tue Mar 11 17:26:20 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 11 Mar 2008 22:26:20 +0000 (GMT)
Subject: [Swift-devel] Re: Question of wrapper.sh
In-Reply-To: <Pine.LNX.4.64.0803100609520.18350@dildano.hawaga.org.uk>
References: <47BEC111.8030503@mcs.anl.gov>
	<Pine.LNX.4.64.0802262356100.32747@dildano.hawaga.org.uk>
	<47C5039F.8020001@cs.uchicago.edu>
	<Pine.LNX.4.64.0802270632200.32747@dildano.hawaga.org.uk>
	<47D07BDC.60406@uchicago.edu>
	<Pine.LNX.4.64.0803062323120.32747@dildano.hawaga.org.
	uk> <47D0DA98.6010308@cs.uchicago.edu>
	<Pine.LNX.4.64.0803070840150.32747@dildano.hawaga.org.uk
	> <47D14650.2000206@cs.uchicago.edu>
	<Pine.LNX.4.64.0803071541400.32747@dildano.hawaga.org.uk>
	<47D16F6D.9070903@uchicago.edu>
	<Pine.LNX.4.64.0803071638260.18350@dildano.hawaga.org.uk>
	<47D1DB48.4080709@uchicago.edu>
	<Pine.LNX.4.64.0803080040220.18350@dildano.hawaga.org.uk>
	<47D3838B.3040506@uchicago.edu>
	<Pine.LNX.4.64.0803100028080.18350@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803100044250.18350@dildano.hawaga.org.uk>
	<47D48608.2060204@cs.uchicago!
	.edu> <47D48F91.6040802@uchicago.e du>
	<47D4CD37.7020006@cs.uchicago.edu>
	<Pine.LNX.4.64.0803100609520.18350@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0803112226150.28951@dildano.hawaga.org.uk>

anything happening with this now?
-- 


From benc at hawaga.org.uk  Wed Mar 12 22:32:02 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 13 Mar 2008 03:32:02 +0000 (GMT)
Subject: [Swift-devel] swift 0.4-rc2
Message-ID: <Pine.LNX.4.64.0803130316040.28951@dildano.hawaga.org.uk>


I have put a second release candidate online:
http://www.ci.uchicago.edu/~benc/vdsk-0.4rc2.tar.gz

This is from newer versions of the SVNs: swift r1718 and cog r1934.

What's changed from rc1: wrapper log stageout; more conservative job 
throttles; high resolution timestamping in wrapper logs where available; 
host type selection for GRAM4.

As before, if no major bugs appear, I'll release it in a couple of days. 
Please test.

-- 


From benc at hawaga.org.uk  Wed Mar 12 23:40:52 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 13 Mar 2008 04:40:52 +0000 (GMT)
Subject: [Swift-devel] more ftp errors running terminable->tg uc
Message-ID: <Pine.LNX.4.64.0803130403470.28951@dildano.hawaga.org.uk>


In addition to the mlst error reporting in the thread 'stageout: Expected 
multiline reply', today I see a similar-but-different error with other 
swift configurations running on terminable going to TG UC. The below 
occurs when I try PBS+GRAM4, fork+gram4, fork+gram2 (with the mlst error 
occuring with pbs+gram2 as before).

I do not get this behaviour submitting from terminable to TeraPort. I do 
get it submitting to the OSG site UCLA_Saxon_Tier3.

This is all running the test 130-fmri, which has in the past managed to 
trigger a bunch of race conditions that the other tests haven't. In 
general the other test I've been running, 061-cattwo, has been working 
mostly ok.

The specific configurations that I use are in the svn in 
vdsks/tests/sites/

Server refused performing the request. Custom message:  (error code 1) 
[Nested exception message:  Custom message: Unexpected reply: 451 ocurred 
during retrieve()
org.globus.ftp.exception.DataChannelException: setPassive() must match 
store() and setActive() - retrieve()  (error code 2)
org.globus.ftp.exception.DataChannelException: setPassive() must match 
store() and setActive() - retrieve()  (error code 2)
        at 
org.globus.ftp.extended.GridFTPServerFacade.retrieve(GridFTPServerFacade.java:469)
        at org.globus.ftp.FTPClient.put(FTPClient.java:1289)
        at 
org.globus.cog.abstraction.impl.file.gridftp.old.FileResourceImpl.putFile(FileResourceImpl.java:399)
        at 
org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.doDestination(DelegatedFileTransferHandler.java:356)
        at 
org.globus.cog.abstraction.impl.fileTransfer.CachingDelegatedFileTransferHandler.doDestination(CachingDelegatedFileTransferHandler.java:47)
        at 
org.globus.cog.abstraction.impl.fileTransfer.DelegatedFileTransferHandler.run(DelegatedFileTransferHandler.java:493)
        at java.lang.Thread.run(Thread.java:595)
]

-- 


From benc at hawaga.org.uk  Thu Mar 13 13:16:08 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 13 Mar 2008 18:16:08 +0000 (GMT)
Subject: [Swift-devel] Re: more ftp errors running terminable->tg uc
In-Reply-To: <Pine.LNX.4.64.0803130403470.28951@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0803130403470.28951@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0803131815290.28951@dildano.hawaga.org.uk>

I also get the previously reported errors when running on 
tg-login1.uc.teragrid.org - the same sites that don't work from terminable 
don't work here, and teraport, which does work from terminable does work 
here.
-- 


From skenny at uchicago.edu  Fri Mar 14 16:42:30 2008
From: skenny at uchicago.edu (skenny at uchicago.edu)
Date: Fri, 14 Mar 2008 16:42:30 -0500 (CDT)
Subject: [Swift-devel] Re: misc swift errors
Message-ID: <20080314164230.BBU09580@m4500-02.uchicago.edu>

so, ben is suggesting that the use of relative paths within
the swift script may be the problem here. 

can you rerun giving the mapper the full path?

string
inputName=@strcat("/disks/gpfs/fmri/cnari/swift/lhROI7_4p2filter_input/sub.",subject,".block",file,".txt");

---- Original message ----
>Date: Fri, 14 Mar 2008 14:53:10 -0500
>From: "Uri Hasson" <uhasson at gmail.com>  
>Subject: Re: misc swift errors  
>To: skenny at uchicago.edu
>
>Hi Sarah,
>
>I'm using the following:
>A swift properties file:
>/disks/gpfs/fmri/cnari/swift/projects/uhasson/peakfit_project/all_runs/run_0314_lhREG7_f4p2/swift.properties
>
>and config files in
>/disks/gpfs/fmri/cnari/swift/projects/uhasson/peakfit_project/swift.conf
>
>
>On Fri, Mar 14, 2008 at 2:51 PM,  <skenny at uchicago.edu> wrote:
>> which config files (tc.data and sites.xml) are you using for
>>  this run?
>>
>>
>>
>>  ---- Original message ----
>>  >Date: Fri, 14 Mar 2008 14:30:35 -0500
>>  >From: "Uri Hasson" <uhasson at gmail.com>
>>  >Subject: misc swift errors
>>  >To: "Sarah Kenny" <skenny at uchicago.edu>, "Mihael Hategan"
>>  <hategan at mcs.anl.gov>
>>  >
>>  >Hey SWIFT gurus.. I'm running swift heavy duty and
>>  encountering some
>>  >errors I can't track.
>>  >
>>  >1) In a log file of a run that's still ongoing there are
>>  errors on
>>  >"status files not" found:
>> 
>/disks/gpfs/fmri/cnari/swift/projects/uhasson/peakfit_project/all_runs/run_0314_lhREG8_f4p2/ccf-perm-wf-20080314-1308-g61h5kic.log
>>  >But the job seems to be continuing...
>>  >
>>  >2) another run simply crashed with errors. Log at:
>> 
>/disks/gpfs/fmri/cnari/swift/projects/uhasson/peakfit_project/all_runs/run_0314_lhREG7_f4p2/ccf-perm-wf-20080314-1006-ix8mnzfb.log
>>  >It sais it can't link to a file that exists..
>>  >
>>  >Any ideas -- much appreciated.
>>  >
>>  >Uri
>>


From wilde at mcs.anl.gov  Fri Mar 14 17:47:04 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 14 Mar 2008 17:47:04 -0500
Subject: [Swift-devel] Re: more ftp errors running terminable->tg uc
In-Reply-To: <Pine.LNX.4.64.0803131815290.28951@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0803130403470.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803131815290.28951@dildano.hawaga.org.uk>
Message-ID: <47DB0068.6020305@mcs.anl.gov>

Do you have (or need to get) the gridftp group involved in this?

Or is this a cog-level error that only Mihael supports at the moment?

Is the problem reproducible with globus-url-copy?

On 3/13/08 1:16 PM, Ben Clifford wrote:
> I also get the previously reported errors when running on 
> tg-login1.uc.teragrid.org - the same sites that don't work from terminable 
> don't work here, and teraport, which does work from terminable does work 
> here.


From benc at hawaga.org.uk  Fri Mar 14 17:51:50 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 14 Mar 2008 22:51:50 +0000 (GMT)
Subject: [Swift-devel] Re: more ftp errors running terminable->tg uc
In-Reply-To: <47DB0068.6020305@mcs.anl.gov>
References: <Pine.LNX.4.64.0803130403470.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803131815290.28951@dildano.hawaga.org.uk>
	<47DB0068.6020305@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0803142251330.28951@dildano.hawaga.org.uk>


On Fri, 14 Mar 2008, Michael Wilde wrote:

> Do you have (or need to get) the gridftp group involved in this?
> 
> Or is this a cog-level error that only Mihael supports at the moment?
> 
> Is the problem reproducible with globus-url-copy?

More report coming soon. Please wait.

-- 


From benc at hawaga.org.uk  Fri Mar 14 18:14:23 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 14 Mar 2008 23:14:23 +0000 (GMT)
Subject: [Swift-devel] Re: more ftp errors running terminable->tg uc
In-Reply-To: <Pine.LNX.4.64.0803131815290.28951@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0803130403470.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803131815290.28951@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0803142256260.28951@dildano.hawaga.org.uk>

I have dug some more into this.

The cog gridftp provider enables data channel reuse when talking to 
gridftp servers that report exactly version 2.3.

Some of the sites that I am testing against report that version. Some 
report version 2.5.

The sites which are version 2.3 fail to run test workflow '130-fmri' in 
the tests/language-behaviour directory. The sites which are not 2.3 do not 
exhibit this problem.

This happens submitting from both tg-login1.uc.teragrid.org and from 
terminable.ci.uchicago.edu

On terminable:

If I change the cog gridftp provider to enable gridftp data channel reuse 
for version 2.5 too, then the 2.5 sites also break.

If I disable data channel reuse entirely (which appears to need a source 
code change) then all site tests work ok.

There are two separate issues here:

This needs fixing in general, presumably in cog. At the moment, I'm not 
particularly inclined to spend large amounts of time learning how the cog 
ftp provider works when potentially mihael could look at it. However, its 
unclear how much time mihael has to work on this, given his other projects 
and I have no particular belief that it will be fixed any time soon.

In a Swift-specific context, I'm happy for data-channel reuse to be turned 
off for now (eg until someone figures out what is up at the cog level) - 
its already not used for any recent gridftp server (i.e. v2.5) such as 
tg-gridftp.uc.teragrid.org.

No one has reported this as a problem in the wild (yet). I suspect test 
130-fmri is especially good at exhibiting this problem.

I think therefore that this should not be a release-stopped for 0.4; but 
that should anyone actually come across it in the wild we should rapidly 
put out a 0.4.1 or a 0.5 with data channel caching disabled.

I would appreciate commentary on:

   i) the above release proposal

  ii) the likelihood that Mihael will have time to look at this and when 
      that would happen (which is essentially the question - do I have to 
      go learn the guts of the gt2 cog provider?)

-- 


From wilde at mcs.anl.gov  Fri Mar 14 18:23:06 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 14 Mar 2008 18:23:06 -0500
Subject: [Swift-devel] Re: more ftp errors running terminable->tg uc
In-Reply-To: <Pine.LNX.4.64.0803142256260.28951@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0803130403470.28951@dildano.hawaga.org.uk>	<Pine.LNX.4.64.0803131815290.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803142256260.28951@dildano.hawaga.org.uk>
Message-ID: <47DB08DA.7040703@mcs.anl.gov>

i) proposal sounds good to me

ii) Mihael is on pseudo-vacation (supposed to be real vacation at the 
moment, but he is being a great guy to help launch an i2u2 release that 
slipped).  So lets wait for Mihael to weigh in. Only thing I can offer 
is once i2u2 release is live and stable, fix gridftp next, modulo 
vacation preferences).

- Mike


On 3/14/08 6:14 PM, Ben Clifford wrote:
> I have dug some more into this.
> 
> The cog gridftp provider enables data channel reuse when talking to 
> gridftp servers that report exactly version 2.3.
> 
> Some of the sites that I am testing against report that version. Some 
> report version 2.5.
> 
> The sites which are version 2.3 fail to run test workflow '130-fmri' in 
> the tests/language-behaviour directory. The sites which are not 2.3 do not 
> exhibit this problem.
> 
> This happens submitting from both tg-login1.uc.teragrid.org and from 
> terminable.ci.uchicago.edu
> 
> On terminable:
> 
> If I change the cog gridftp provider to enable gridftp data channel reuse 
> for version 2.5 too, then the 2.5 sites also break.
> 
> If I disable data channel reuse entirely (which appears to need a source 
> code change) then all site tests work ok.
> 
> There are two separate issues here:
> 
> This needs fixing in general, presumably in cog. At the moment, I'm not 
> particularly inclined to spend large amounts of time learning how the cog 
> ftp provider works when potentially mihael could look at it. However, its 
> unclear how much time mihael has to work on this, given his other projects 
> and I have no particular belief that it will be fixed any time soon.
> 
> In a Swift-specific context, I'm happy for data-channel reuse to be turned 
> off for now (eg until someone figures out what is up at the cog level) - 
> its already not used for any recent gridftp server (i.e. v2.5) such as 
> tg-gridftp.uc.teragrid.org.
> 
> No one has reported this as a problem in the wild (yet). I suspect test 
> 130-fmri is especially good at exhibiting this problem.
> 
> I think therefore that this should not be a release-stopped for 0.4; but 
> that should anyone actually come across it in the wild we should rapidly 
> put out a 0.4.1 or a 0.5 with data channel caching disabled.
> 
> I would appreciate commentary on:
> 
>    i) the above release proposal
> 
>   ii) the likelihood that Mihael will have time to look at this and when 
>       that would happen (which is essentially the question - do I have to 
>       go learn the guts of the gt2 cog provider?)
> 


From benc at hawaga.org.uk  Fri Mar 14 19:42:43 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Sat, 15 Mar 2008 00:42:43 +0000 (GMT)
Subject: [Swift-devel] Re: swift 0.4-rc2
In-Reply-To: <Pine.LNX.4.64.0803130316040.28951@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0803130316040.28951@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0803150042230.28951@dildano.hawaga.org.uk>


On Thu, 13 Mar 2008, Ben Clifford wrote:

> http://www.ci.uchicago.edu/~benc/vdsk-0.4rc2.tar.gz
> Please test.

please?

-- 


From hategan at mcs.anl.gov  Sat Mar 15 04:36:02 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Sat, 15 Mar 2008 04:36:02 -0500
Subject: [Swift-devel] Re: more ftp errors running terminable->tg uc
In-Reply-To: <Pine.LNX.4.64.0803142256260.28951@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0803130403470.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803131815290.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803142256260.28951@dildano.hawaga.org.uk>
Message-ID: <1205573762.24604.2.camel@blabla.mcs.anl.gov>


> I would appreciate commentary on:
> 
>    i) the above release proposal
> 
>   ii) the likelihood that Mihael will have time to look at this and when 
>       that would happen (which is essentially the question - do I have to 
>       go learn the guts of the gt2 cog provider?)

Looking at the plan we have, I've only spent 1 week real time on making
transfers faster. Which leaves another week of real time to fix the
problems.

> 


From wilde at mcs.anl.gov  Sat Mar 15 07:34:40 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sat, 15 Mar 2008 07:34:40 -0500
Subject: [Swift-devel] Re: swift 0.4-rc2
In-Reply-To: <Pine.LNX.4.64.0803150042230.28951@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0803130316040.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803150042230.28951@dildano.hawaga.org.uk>
Message-ID: <47DBC260.1010104@mcs.anl.gov>

Im testing this weekend, but at the moment using 1723.
I can switch to rc2 when I get a chance.

On 3/14/08 7:42 PM, Ben Clifford wrote:
> On Thu, 13 Mar 2008, Ben Clifford wrote:
> 
>> http://www.ci.uchicago.edu/~benc/vdsk-0.4rc2.tar.gz
>> Please test.
> 
> please?
> 


From wilde at mcs.anl.gov  Sun Mar 16 19:48:23 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 16 Mar 2008 19:48:23 -0500
Subject: [Swift-devel] swift-falkon problem
Message-ID: <47DDBFD7.2050700@mcs.anl.gov>

Ioan,

Im stuck at:

RunID: 20080316-1643-g4n8t252
Progress:
runam3 started
error: Notification(int timeout): socket = new ServerSocket(recvPort); 
Address already in use
error: Notification(int timeout): socket = new ServerSocket(recvPort); 
Address already in use
Waiting for notification for 0 ms
Received notification with 1 messages
Failed to transfer wrapper log from amps1-20080316-1643-g4n8t252/info/0/sico
runam3 failed
Execution failed:
         Exception in runam3:
Arguments: [0000, 0.1899, 0.1858]
Host: sico

Does this look familiar?

--

What Im confused about is:

- the deef-provider code that I get with a swift checkout seems to have 
out of date falkon stubs (I get a runtime error on a missing xml element)

- if I grab a FalkonStubs jar from Zhao's bgp swift tree and use it in a 
newly compiled swift tree, should that work? It seems to get further.

It *seems* like swift is reaching Falkon - I can see something in a 
falkon logfile that looks like swift-generated job ids) but then I'm 
getting the errors above.

The log file doesnt contain any details, just whats below.

I'll double-check all my steps and package up the full log file, but 
wanted to get this out to you before I spend too much more time 
debugging, hoping someone recognizes the problem.

I note that I havent yet found the strings above, like "Waiting for 
notification" in the swift source tree.

Thanks,

Mike


2008-03-16 16:43:42,807-0600 INFO  vdl:createdirset END 
jobid=runam3-0cu5avpi - Done initializing directory structure
2008-03-16 16:43:42,809-0600 INFO  vdl:dostagein START 
jobid=runam3-0cu5avpi - Staging in files
2008-03-16 16:43:42,810-0600 INFO  vdl:dostagein END 
jobid=runam3-0cu5avpi - Staging in finished
2008-03-16 16:43:42,812-0600 DEBUG vdl:execute2 JOB_START 
jobid=runam3-0cu5avpi tr=runam3 arguments=[0000, 0.1899, 0.1858] 
tmpdir=amps1-20080316-1643-g4n8t252/jobs/0/runam3-0cu5avpi host=sico
2008-03-16 16:43:42,829-0600 DEBUG WeightedHostScoreScheduler 
multiplyScore(sico:0.000(1.000):1/1000002, -0.2)
2008-03-16 16:43:42,829-0600 DEBUG WeightedHostScoreScheduler Old score: 
0.000, new score: -0.200
2008-03-16 16:43:43,693-0600 DEBUG TaskImpl Task(type=JOB_SUBMISSION, 
identity=urn:0-1-1-1205707420808) setting status to Submitting
2008-03-16 16:43:43,693-0600 DEBUG TaskImpl Task(type=JOB_SUBMISSION, 
identity=urn:0-1-1-1205707420808) setting status to Submitted
2008-03-16 16:43:43,693-0600 DEBUG WeightedHostScoreScheduler Submission 
time for Task(type=JOB_SUBMISSION, identity=urn:0-1-1-1205707420808): 
0ms. Score delta: 0.002564102564102564
2008-03-16 16:43:43,694-0600 DEBUG WeightedHostScoreScheduler 
multiplyScore(sico:-0.200(0.889):1/889402, 0.002564102564102564)
2008-03-16 16:43:43,694-0600 DEBUG WeightedHostScoreScheduler Old score: 
-0.200, new score: -0.197
2008-03-16 16:43:43,694-0600 INFO  JobSubmissionTaskHandler Job submitted
2008-03-16 16:43:44,213-0600 DEBUG TaskImpl Task(type=JOB_SUBMISSION, 
identity=urn:0-1-1-1205707420808) setting status to Active
2008-03-16 16:43:44,213-0600 DEBUG TaskImpl Task(type=JOB_SUBMISSION, 
identity=urn:0-1-1-1205707420808) setting status to Failed
2008-03-16 16:43:44,218-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
jobid=runam3-0cu5avpi - Application exception: Task failed
         task:execute @ vdl-int.k, line: 386
         sys:sequential @ vdl-int.k, line: 378
         sys:try @ vdl-int.k, line: 377
         task:allocatehost @ vdl-int.k, line: 356
         vdl:execute2 @ execute-default.k, line: 23
         sys:restartonerror @ execute-default.k, line: 21
         sys:sequential @ execute-default.k, line: 19
         sys:try @ execute-default.k, line: 18
         sys:if @ execute-default.k, line: 17
         sys:then @ execute-default.k, line: 16
         sys:if @ execute-default.k, line: 15
         vdl:execute @ amps1.kml, line: 52
         runam3 @ amps1.kml, line: 92
         sys:sequential @ amps1.kml, line: 91
         sys:parallelfor @ amps1.kml, line: 73
         sys:sequential @ amps1.kml, line: 72
         doall @ amps1.kml, line: 142
         sys:sequential @ amps1.kml, line: 141
         sys:parallel @ amps1.kml, line: 131
         vdl:mainp @ amps1.kml, line: 130
         mainp @ vdl.k, line: 150
         vdl:mains @ amps1.kml, line: 128
         vdl:mains @ amps1.kml, line: 128
         rlog:restartlog @ amps1.kml, line: 126
         kernel:project @ amps1.kml, line: 2
         amps1-20080316-1643-g4n8t252


From benc at hawaga.org.uk  Mon Mar 17 06:05:15 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 17 Mar 2008 11:05:15 +0000 (GMT)
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <47DDBFD7.2050700@mcs.anl.gov>
References: <47DDBFD7.2050700@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>


So from the Swift log you paste, this line:

2008-03-16 16:43:44,213-0600 DEBUG TaskImpl Task(type=JOB_SUBMISSION,           
identity=urn:0-1-1-1205707420808) setting status to Active                      
2008-03-16 16:43:44,213-0600 DEBUG TaskImpl Task(type=JOB_SUBMISSION,           
identity=urn:0-1-1-1205707420808) setting status to Failed   

suggests that provider-deef is reporting a failure up to the Swift 
runtime. There may be some more logs that you can turn on at the 
provider-deef or falkon layer, but I don't know what the likely ones would 
be.

>From a process point-of-view, I'm concerned about this:

> - the deef-provider code that I get with a swift checkout seems to have 
> out     
> of date falkon stubs (I get a runtime error on a missing xml element)           
> - if I grab a FalkonStubs jar from Zhao's bgp swift tree and use it in a        
> newly compiled swift tree, should that work? It seems to get further.  

If provider-deef needs updating, those updates should be made in SVN; if 
it doesn't need updating, then you have some other problem. Ioan and Zhao, 
if you had to update something to make it work, please commit that change.

-- 


From iraicu at cs.uchicago.edu  Mon Mar 17 07:43:22 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Mon, 17 Mar 2008 07:43:22 -0500
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
Message-ID: <47DE676A.8090604@cs.uchicago.edu>


Ben Clifford wrote:
> So from the Swift log you paste, this line:
>
> 2008-03-16 16:43:44,213-0600 DEBUG TaskImpl Task(type=JOB_SUBMISSION,           
> identity=urn:0-1-1-1205707420808) setting status to Active                      
> 2008-03-16 16:43:44,213-0600 DEBUG TaskImpl Task(type=JOB_SUBMISSION,           
> identity=urn:0-1-1-1205707420808) setting status to Failed   
>
> suggests that provider-deef is reporting a failure up to the Swift 
> runtime. There may be some more logs that you can turn on at the 
> provider-deef or falkon layer, but I don't know what the likely ones would 
> be.
>   
iraicu at viper:~/java/svn/cog/modules/provider-deef/etc> cat 
log4j.properties.module
log4j.logger.org.apache.axis.utils.JavaUtils=ERROR
log4j.logger.org.globus.cog.abstraction.impl.execution.deef=DEBUG

I have this log4j property, do you have this enabled?  This should 
enable more debug output from the Falkon provider.
> From a process point-of-view, I'm concerned about this:
>
>   
>> - the deef-provider code that I get with a swift checkout seems to have 
>> out     
>> of date falkon stubs (I get a runtime error on a missing xml element)           
>> - if I grab a FalkonStubs jar from Zhao's bgp swift tree and use it in a        
>> newly compiled swift tree, should that work? It seems to get further.  
>>     
>
> If provider-deef needs updating, those updates should be made in SVN; if 
> it doesn't need updating, then you have some other problem. Ioan and Zhao, 
> if you had to update something to make it work, please commit that change.
>   
I know we should update SVN, we just haven't gotten around to it.  I 
just updated the stubs in the Swift SVN (R1727).  Mike, give it a try 
again from SVN.

Ioan

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080317/6f83ab44/attachment.html>

From iraicu at cs.uchicago.edu  Mon Mar 17 08:03:52 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Mon, 17 Mar 2008 08:03:52 -0500
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <47DDBFD7.2050700@mcs.anl.gov>
References: <47DDBFD7.2050700@mcs.anl.gov>
Message-ID: <47DE6C38.40802@cs.uchicago.edu>


Michael Wilde wrote:
> Ioan,
>
> Im stuck at:
>
> RunID: 20080316-1643-g4n8t252
> Progress:
> runam3 started
> error: Notification(int timeout): socket = new ServerSocket(recvPort); 
> Address already in use
> error: Notification(int timeout): socket = new ServerSocket(recvPort); 
> Address already in use
this is just a warning, its not causing any trouble.
> Waiting for notification for 0 ms
> Received notification with 1 messages
this means that the Falkon service sent back a notification, which means 
that all went well, it had received a task, attempted to execute it, and 
returned back a result... but apparently a failed result.
> Failed to transfer wrapper log from 
> amps1-20080316-1643-g4n8t252/info/0/sico
I don't understant this error, how is this error text being generated?  
Falkon only returns back a numeric exit code.  Could this error be a 
post processing error when Swift couldn't manipulate the local file 
system, or it couldn't find some expecting files?  What exit code does 
Falkon return for this task, 0, or something else?
> runam3 failed
> Execution failed:
>         Exception in runam3:
> Arguments: [0000, 0.1899, 0.1858]
> Host: sico
>
> Does this look familiar?
>
> -- 
>
> What Im confused about is:
>
> - the deef-provider code that I get with a swift checkout seems to 
> have out of date falkon stubs (I get a runtime error on a missing xml 
> element)
I just updated them in SVN.
>
> - if I grab a FalkonStubs jar from Zhao's bgp swift tree and use it in 
> a newly compiled swift tree, should that work? It seems to get further.
>
> It *seems* like swift is reaching Falkon - I can see something in a 
> falkon logfile that looks like swift-generated job ids) but then I'm 
> getting the errors above.
We need to figure out if the failure is in executing the tasks in 
Falkon, or if that is OK, and the error is in Swift not finding some 
files afterwards.
>
> The log file doesnt contain any details, just whats below.
>
> I'll double-check all my steps and package up the full log file, but 
> wanted to get this out to you before I spend too much more time 
> debugging, hoping someone recognizes the problem.
>
> I note that I havent yet found the strings above, like "Waiting for 
> notification" in the swift source tree.
That is from the FalkonStubs.jar 
(falkon/service/org/globus/GenericPortal/common/Notification.java), so 
you won't find that.  I should probably disable all the logging from 
FalkonStubs.jar code by default.

Once you enable the Falkon provider debug logging, there are more per 
task logs that get printed... for example, file 
cog/modules/provider-deef/src/org/globus/cog/abstraction/impl/execution/deef/NotificationThread.java 
would print
"Falkon: waiting for notifications...", and then print the contents of 
the notification when it received them...

Ioan

>
> Thanks,
>
> Mike
>
>
>
>
> 2008-03-16 16:43:42,807-0600 INFO  vdl:createdirset END 
> jobid=runam3-0cu5avpi - Done initializing directory structure
> 2008-03-16 16:43:42,809-0600 INFO  vdl:dostagein START 
> jobid=runam3-0cu5avpi - Staging in files
> 2008-03-16 16:43:42,810-0600 INFO  vdl:dostagein END 
> jobid=runam3-0cu5avpi - Staging in finished
> 2008-03-16 16:43:42,812-0600 DEBUG vdl:execute2 JOB_START 
> jobid=runam3-0cu5avpi tr=runam3 arguments=[0000, 0.1899, 0.1858] 
> tmpdir=amps1-20080316-1643-g4n8t252/jobs/0/runam3-0cu5avpi host=sico
> 2008-03-16 16:43:42,829-0600 DEBUG WeightedHostScoreScheduler 
> multiplyScore(sico:0.000(1.000):1/1000002, -0.2)
> 2008-03-16 16:43:42,829-0600 DEBUG WeightedHostScoreScheduler Old 
> score: 0.000, new score: -0.200
> 2008-03-16 16:43:43,693-0600 DEBUG TaskImpl Task(type=JOB_SUBMISSION, 
> identity=urn:0-1-1-1205707420808) setting status to Submitting
> 2008-03-16 16:43:43,693-0600 DEBUG TaskImpl Task(type=JOB_SUBMISSION, 
> identity=urn:0-1-1-1205707420808) setting status to Submitted
> 2008-03-16 16:43:43,693-0600 DEBUG WeightedHostScoreScheduler 
> Submission time for Task(type=JOB_SUBMISSION, 
> identity=urn:0-1-1-1205707420808): 0ms. Score delta: 0.002564102564102564
> 2008-03-16 16:43:43,694-0600 DEBUG WeightedHostScoreScheduler 
> multiplyScore(sico:-0.200(0.889):1/889402, 0.002564102564102564)
> 2008-03-16 16:43:43,694-0600 DEBUG WeightedHostScoreScheduler Old 
> score: -0.200, new score: -0.197
> 2008-03-16 16:43:43,694-0600 INFO  JobSubmissionTaskHandler Job submitted
> 2008-03-16 16:43:44,213-0600 DEBUG TaskImpl Task(type=JOB_SUBMISSION, 
> identity=urn:0-1-1-1205707420808) setting status to Active
> 2008-03-16 16:43:44,213-0600 DEBUG TaskImpl Task(type=JOB_SUBMISSION, 
> identity=urn:0-1-1-1205707420808) setting status to Failed
> 2008-03-16 16:43:44,218-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION 
> jobid=runam3-0cu5avpi - Application exception: Task failed
>         task:execute @ vdl-int.k, line: 386
>         sys:sequential @ vdl-int.k, line: 378
>         sys:try @ vdl-int.k, line: 377
>         task:allocatehost @ vdl-int.k, line: 356
>         vdl:execute2 @ execute-default.k, line: 23
>         sys:restartonerror @ execute-default.k, line: 21
>         sys:sequential @ execute-default.k, line: 19
>         sys:try @ execute-default.k, line: 18
>         sys:if @ execute-default.k, line: 17
>         sys:then @ execute-default.k, line: 16
>         sys:if @ execute-default.k, line: 15
>         vdl:execute @ amps1.kml, line: 52
>         runam3 @ amps1.kml, line: 92
>         sys:sequential @ amps1.kml, line: 91
>         sys:parallelfor @ amps1.kml, line: 73
>         sys:sequential @ amps1.kml, line: 72
>         doall @ amps1.kml, line: 142
>         sys:sequential @ amps1.kml, line: 141
>         sys:parallel @ amps1.kml, line: 131
>         vdl:mainp @ amps1.kml, line: 130
>         mainp @ vdl.k, line: 150
>         vdl:mains @ amps1.kml, line: 128
>         vdl:mains @ amps1.kml, line: 128
>         rlog:restartlog @ amps1.kml, line: 126
>         kernel:project @ amps1.kml, line: 2
>         amps1-20080316-1643-g4n8t252
>
>

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


From wilde at mcs.anl.gov  Mon Mar 17 08:34:28 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 17 Mar 2008 08:34:28 -0500
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <47DE676A.8090604@cs.uchicago.edu>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
Message-ID: <47DE7364.20700@mcs.anl.gov>

I did a clean checkout to get the latest rev directly on the bblogin 
machine (previously I copied the code).

Strange: this time provider-deef didnt show up in the modules directory. 
  I *thought* last time it did, unless I'm imagining things.

Did something just change w.r.t provider-deef, or is my memory faulty?

- Mike


On 3/17/08 7:43 AM, Ioan Raicu wrote:
> 
> 
> Ben Clifford wrote:
>> So from the Swift log you paste, this line:
>>
>> 2008-03-16 16:43:44,213-0600 DEBUG TaskImpl Task(type=JOB_SUBMISSION,           
>> identity=urn:0-1-1-1205707420808) setting status to Active                      
>> 2008-03-16 16:43:44,213-0600 DEBUG TaskImpl Task(type=JOB_SUBMISSION,           
>> identity=urn:0-1-1-1205707420808) setting status to Failed   
>>
>> suggests that provider-deef is reporting a failure up to the Swift 
>> runtime. There may be some more logs that you can turn on at the 
>> provider-deef or falkon layer, but I don't know what the likely ones would 
>> be.
>>   
> iraicu at viper:~/java/svn/cog/modules/provider-deef/etc> cat 
> log4j.properties.module
> log4j.logger.org.apache.axis.utils.JavaUtils=ERROR
> log4j.logger.org.globus.cog.abstraction.impl.execution.deef=DEBUG
> 
> I have this log4j property, do you have this enabled?  This should 
> enable more debug output from the Falkon provider.
>> >From a process point-of-view, I'm concerned about this:
>>
>>   
>>> - the deef-provider code that I get with a swift checkout seems to have 
>>> out     
>>> of date falkon stubs (I get a runtime error on a missing xml element)           
>>> - if I grab a FalkonStubs jar from Zhao's bgp swift tree and use it in a        
>>> newly compiled swift tree, should that work? It seems to get further.  
>>>     
>>
>> If provider-deef needs updating, those updates should be made in SVN; if 
>> it doesn't need updating, then you have some other problem. Ioan and Zhao, 
>> if you had to update something to make it work, please commit that change.
>>   
> I know we should update SVN, we just haven't gotten around to it.  I 
> just updated the stubs in the Swift SVN (R1727).  Mike, give it a try 
> again from SVN.
> 
> Ioan
> 
> -- 
> ===================================================
> Ioan Raicu
> Ph.D. Candidate
> ===================================================
> Distributed Systems Laboratory
> Computer Science Department
> University of Chicago
> 1100 E. 58th Street, Ryerson Hall
> Chicago, IL 60637
> ===================================================
> Email: iraicu at cs.uchicago.edu
> Web:   http://www.cs.uchicago.edu/~iraicu
> http://dev.globus.org/wiki/Incubator/Falkon
> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
> ===================================================
> ===================================================
> 
> 


From wilde at mcs.anl.gov  Mon Mar 17 10:14:37 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 17 Mar 2008 10:14:37 -0500
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <47DE7364.20700@mcs.anl.gov>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu> <47DE7364.20700@mcs.anl.gov>
Message-ID: <47DE8ADD.3020603@mcs.anl.gov>

as i backtrack looking at my svn trees, i see that indeed my memory was 
wrong and i do need to checkout deef explicitly.

- mike


On 3/17/08 8:34 AM, Michael Wilde wrote:
> I did a clean checkout to get the latest rev directly on the bblogin 
> machine (previously I copied the code).
> 
> Strange: this time provider-deef didnt show up in the modules directory. 
>  I *thought* last time it did, unless I'm imagining things.
> 
> Did something just change w.r.t provider-deef, or is my memory faulty?
> 
> - Mike
> 
> 
> 
> On 3/17/08 7:43 AM, Ioan Raicu wrote:
>>
>>
>> Ben Clifford wrote:
>>> So from the Swift log you paste, this line:
>>>
>>> 2008-03-16 16:43:44,213-0600 DEBUG TaskImpl 
>>> Task(type=JOB_SUBMISSION,           identity=urn:0-1-1-1205707420808) 
>>> setting status to Active                      2008-03-16 
>>> 16:43:44,213-0600 DEBUG TaskImpl Task(type=JOB_SUBMISSION,           
>>> identity=urn:0-1-1-1205707420808) setting status to Failed  
>>> suggests that provider-deef is reporting a failure up to the Swift 
>>> runtime. There may be some more logs that you can turn on at the 
>>> provider-deef or falkon layer, but I don't know what the likely ones 
>>> would be.
>>>   
>> iraicu at viper:~/java/svn/cog/modules/provider-deef/etc> cat 
>> log4j.properties.module
>> log4j.logger.org.apache.axis.utils.JavaUtils=ERROR
>> log4j.logger.org.globus.cog.abstraction.impl.execution.deef=DEBUG
>>
>> I have this log4j property, do you have this enabled?  This should 
>> enable more debug output from the Falkon provider.
>>> >From a process point-of-view, I'm concerned about this:
>>>
>>>  
>>>> - the deef-provider code that I get with a swift checkout seems to 
>>>> have out     of date falkon stubs (I get a runtime error on a 
>>>> missing xml element)           - if I grab a FalkonStubs jar from 
>>>> Zhao's bgp swift tree and use it in a        newly compiled swift 
>>>> tree, should that work? It seems to get further.      
>>>
>>> If provider-deef needs updating, those updates should be made in SVN; 
>>> if it doesn't need updating, then you have some other problem. Ioan 
>>> and Zhao, if you had to update something to make it work, please 
>>> commit that change.
>>>   
>> I know we should update SVN, we just haven't gotten around to it.  I 
>> just updated the stubs in the Swift SVN (R1727).  Mike, give it a try 
>> again from SVN.
>>
>> Ioan
>>
>> -- 
>> ===================================================
>> Ioan Raicu
>> Ph.D. Candidate
>> ===================================================
>> Distributed Systems Laboratory
>> Computer Science Department
>> University of Chicago
>> 1100 E. 58th Street, Ryerson Hall
>> Chicago, IL 60637
>> ===================================================
>> Email: iraicu at cs.uchicago.edu
>> Web:   http://www.cs.uchicago.edu/~iraicu
>> http://dev.globus.org/wiki/Incubator/Falkon
>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
>> ===================================================
>> ===================================================
>>
>>
> 


From wilde at mcs.anl.gov  Mon Mar 17 10:23:37 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 17 Mar 2008 10:23:37 -0500
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <47DE8ADD.3020603@mcs.anl.gov>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu> <47DE7364.20700@mcs.anl.gov>
	<47DE8ADD.3020603@mcs.anl.gov>
Message-ID: <47DE8CF9.8060907@mcs.anl.gov>

With a clean checkout of 1727 and explicit checkout of provider-deef, 
falkon works on the sicortex.

- mike


On 3/17/08 10:14 AM, Michael Wilde wrote:
> as i backtrack looking at my svn trees, i see that indeed my memory was 
> wrong and i do need to checkout deef explicitly.
> 
> - mike
> 
> 
> On 3/17/08 8:34 AM, Michael Wilde wrote:
>> I did a clean checkout to get the latest rev directly on the bblogin 
>> machine (previously I copied the code).
>>
>> Strange: this time provider-deef didnt show up in the modules 
>> directory.  I *thought* last time it did, unless I'm imagining things.
>>
>> Did something just change w.r.t provider-deef, or is my memory faulty?
>>
>> - Mike
>>
>>
>>
>> On 3/17/08 7:43 AM, Ioan Raicu wrote:
>>>
>>>
>>> Ben Clifford wrote:
>>>> So from the Swift log you paste, this line:
>>>>
>>>> 2008-03-16 16:43:44,213-0600 DEBUG TaskImpl 
>>>> Task(type=JOB_SUBMISSION,           
>>>> identity=urn:0-1-1-1205707420808) setting status to 
>>>> Active                      2008-03-16 16:43:44,213-0600 DEBUG 
>>>> TaskImpl Task(type=JOB_SUBMISSION,           
>>>> identity=urn:0-1-1-1205707420808) setting status to Failed  suggests 
>>>> that provider-deef is reporting a failure up to the Swift runtime. 
>>>> There may be some more logs that you can turn on at the 
>>>> provider-deef or falkon layer, but I don't know what the likely ones 
>>>> would be.
>>>>   
>>> iraicu at viper:~/java/svn/cog/modules/provider-deef/etc> cat 
>>> log4j.properties.module
>>> log4j.logger.org.apache.axis.utils.JavaUtils=ERROR
>>> log4j.logger.org.globus.cog.abstraction.impl.execution.deef=DEBUG
>>>
>>> I have this log4j property, do you have this enabled?  This should 
>>> enable more debug output from the Falkon provider.
>>>> >From a process point-of-view, I'm concerned about this:
>>>>
>>>>  
>>>>> - the deef-provider code that I get with a swift checkout seems to 
>>>>> have out     of date falkon stubs (I get a runtime error on a 
>>>>> missing xml element)           - if I grab a FalkonStubs jar from 
>>>>> Zhao's bgp swift tree and use it in a        newly compiled swift 
>>>>> tree, should that work? It seems to get further.      
>>>>
>>>> If provider-deef needs updating, those updates should be made in 
>>>> SVN; if it doesn't need updating, then you have some other problem. 
>>>> Ioan and Zhao, if you had to update something to make it work, 
>>>> please commit that change.
>>>>   
>>> I know we should update SVN, we just haven't gotten around to it.  I 
>>> just updated the stubs in the Swift SVN (R1727).  Mike, give it a try 
>>> again from SVN.
>>>
>>> Ioan
>>>
>>> -- 
>>> ===================================================
>>> Ioan Raicu
>>> Ph.D. Candidate
>>> ===================================================
>>> Distributed Systems Laboratory
>>> Computer Science Department
>>> University of Chicago
>>> 1100 E. 58th Street, Ryerson Hall
>>> Chicago, IL 60637
>>> ===================================================
>>> Email: iraicu at cs.uchicago.edu
>>> Web:   http://www.cs.uchicago.edu/~iraicu
>>> http://dev.globus.org/wiki/Incubator/Falkon
>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
>>> ===================================================
>>> ===================================================
>>>
>>>
>>
> 


From benc at hawaga.org.uk  Mon Mar 17 14:16:34 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 17 Mar 2008 19:16:34 +0000 (GMT)
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <47DE6C38.40802@cs.uchicago.edu>
References: <47DDBFD7.2050700@mcs.anl.gov> <47DE6C38.40802@cs.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0803171910330.28951@dildano.hawaga.org.uk>


On Mon, 17 Mar 2008, Ioan Raicu wrote:

> > Failed to transfer wrapper log from amps1-20080316-1643-g4n8t252/info/0/sico

> I don't understant this error, how is this error text being generated?  Falkon
> only returns back a numeric exit code.  Could this error be a post processing
> error when Swift couldn't manipulate the local file system, or it couldn't
> find some expecting files?  What exit code does Falkon return for this task,
> 0, or something else?


That bit is a follow-on error, so pretty much ignore it - I haven't 
figured out what the right thing to do is for presenting it to the user. 
Basically:

  1. swift tries to run an execuable (using provider-deef, in this case)
  2. run of executable fails (i.e. provider-deef is passing back an erro)
  3. swift tries to stage back the wrapper log to help diagnosis
  4. the wrapper log doesn't exist (presumably the wrapper never executed 
that far in the failed executable step 1)
  5. swift reports the above error/warning that step 3 failed.

So that line is an error that is a follow on from step 1 failing.

> We need to figure out if the failure is in executing the tasks in Falkon, or
> if that is OK, and the error is in Swift not finding some files afterwards.

Provider-deef is reporting that execution failed. So its not the second 
one of those. But there isn't enough log information in mike's report to 
indicate where below the swift/provider-deef interface the error is 
occurring.

The present swift+falkon joint deployment code still seems screwy enough 
to not merge in the provider-deef log4j command, which is annying. I'll 
have a look at that.

-- 


From benc at hawaga.org.uk  Mon Mar 17 14:19:39 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 17 Mar 2008 19:19:39 +0000 (GMT)
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <47DE676A.8090604@cs.uchicago.edu>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>


On Mon, 17 Mar 2008, Ioan Raicu wrote:

> iraicu at viper:~/java/svn/cog/modules/provider-deef/etc> cat
> log4j.properties.module
> log4j.logger.org.apache.axis.utils.JavaUtils=ERROR
> log4j.logger.org.globus.cog.abstraction.impl.execution.deef=DEBUG
> 
> I have this log4j property, do you have this enabled?  This should 
> enable more debug output from the Falkon provider.

If deploying swift with this command: ant -Dwith-provider-deef redist
then it looks like those lines don't get merged in. Mike, add those lines 
yourself to your dist/swift-0.3-dev/etc/log4j.properties file.

-- 


From wilde at mcs.anl.gov  Mon Mar 17 14:53:38 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 17 Mar 2008 14:53:38 -0500
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
Message-ID: <47DECC42.4020500@mcs.anl.gov>

OK, thanks, will do.

My earlier message that "it worked" was premature - my sites file was 
doing local execution.

I pushed forward past a few other errors and am now stuck as follows.

As far as I can tell Im now getting the "NFS not syncing" problem.

Swift creates the working dir for the workflow as a local file reference 
to an NFS-mounted directory. When swift tells falkon to run 
shared/wrapper.sh, its not there yet.  When I look after the workflow 
has failed, it is indeed there.

What I'd rather do here is tell swift to use scp rather than 
direct-file-access as the data provider.  Do you know how to do that? Or 
are there any other data transports to consider?

(One alternative is to get gridftp running on sico).


On 3/17/08 2:19 PM, Ben Clifford wrote:
> On Mon, 17 Mar 2008, Ioan Raicu wrote:
> 
>> iraicu at viper:~/java/svn/cog/modules/provider-deef/etc> cat
>> log4j.properties.module
>> log4j.logger.org.apache.axis.utils.JavaUtils=ERROR
>> log4j.logger.org.globus.cog.abstraction.impl.execution.deef=DEBUG
>>
>> I have this log4j property, do you have this enabled?  This should 
>> enable more debug output from the Falkon provider.
> 
> If deploying swift with this command: ant -Dwith-provider-deef redist
> then it looks like those lines don't get merged in. Mike, add those lines 
> yourself to your dist/swift-0.3-dev/etc/log4j.properties file.
> 


From benc at hawaga.org.uk  Mon Mar 17 14:57:07 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 17 Mar 2008 19:57:07 +0000 (GMT)
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <47DECC42.4020500@mcs.anl.gov>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>


what does your filesystem layout look like?

Where are you running swift? And where are you putting your 
scicortex site directory? On an NFS that is also accessible from your 
submit machine? If so, what path?

-- 


From wilde at mcs.anl.gov  Mon Mar 17 15:19:23 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 17 Mar 2008 15:19:23 -0500
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
Message-ID: <47DED24B.8070900@mcs.anl.gov>

Sorry - another mis-diagnosis and incorrect conclusion on my part.

Zhao just told me that we have out of date falkon worker code on the 
sicortex that is not chdir'ing to the cwd arg of the falkon request.

That explains what Im seeing. Its being fixed now and checked in.

--

To answer your questions though:

Im running swift on a linux box bblogin.mcs.anl.gov

It mounts the sicortex under /sicortex-homes

I run swift from /sicortex-homes/wilde/amiga/run

My sites file says:

<pool handle="sico">
       <gridftp  url="local://localhost"/>
       <execution provider="deef"
 
url="http://140.221.37.30:50001/wsrf/services/GenericPortal/core/WS/GPFactoryService"/>
       <workdirectory>/home/wilde/swiftwork</workdirectory>
</pool>

and /home/wilde/swiftwork on bblogin is a symlink to 
/sicortex-homes/wilde/swiftwork

so that when swift writes files to the sicortex dir (eg when it creates 
shared/*) its using the same pathname that the worker-side will use when 
the job runs.  Ie, even though the mount-points differ between the swift 
host and the worker host, symlinks make the workdir appear under same 
name on both sides.

If NFS adheres to its close-to-open-coherence semantics, this then 
should I think work.

My scp-provider question is probably still worth answering and trying if 
this doesnt work.

- Mike


On 3/17/08 2:57 PM, Ben Clifford wrote:
> what does your filesystem layout look like?
> 
> Where are you running swift? And where are you putting your 
> scicortex site directory? On an NFS that is also accessible from your 
> submit machine? If so, what path?
> 


From zhaozhang at uchicago.edu  Mon Mar 17 15:24:54 2008
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Mon, 17 Mar 2008 15:24:54 -0500
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <47DED24B.8070900@mcs.anl.gov>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
	<47DED24B.8070900@mcs.anl.gov>
Message-ID: <47DED396.4060304@uchicago.edu>

Hi,

The attachment is the BGexec source code to use with both non-swift and 
swift. To use it, first compile it like "gcc -o BGexec BGexec.c", run 
invoke it, run "./BGexec 127.0.0.1 55000 55001 -debug swift" the last 
option is to indicate that we are running BGexec with swift, if not 
simply change it to no.

zhao

Michael Wilde wrote:
> Sorry - another mis-diagnosis and incorrect conclusion on my part.
>
> Zhao just told me that we have out of date falkon worker code on the 
> sicortex that is not chdir'ing to the cwd arg of the falkon request.
>
> That explains what Im seeing. Its being fixed now and checked in.
>
> -- 
>
> To answer your questions though:
>
> Im running swift on a linux box bblogin.mcs.anl.gov
>
> It mounts the sicortex under /sicortex-homes
>
> I run swift from /sicortex-homes/wilde/amiga/run
>
> My sites file says:
>
> <pool handle="sico">
>       <gridftp  url="local://localhost"/>
>       <execution provider="deef"
>
> url="http://140.221.37.30:50001/wsrf/services/GenericPortal/core/WS/GPFactoryService"/> 
>
>       <workdirectory>/home/wilde/swiftwork</workdirectory>
> </pool>
>
> and /home/wilde/swiftwork on bblogin is a symlink to 
> /sicortex-homes/wilde/swiftwork
>
> so that when swift writes files to the sicortex dir (eg when it 
> creates shared/*) its using the same pathname that the worker-side 
> will use when the job runs.  Ie, even though the mount-points differ 
> between the swift host and the worker host, symlinks make the workdir 
> appear under same name on both sides.
>
> If NFS adheres to its close-to-open-coherence semantics, this then 
> should I think work.
>
> My scp-provider question is probably still worth answering and trying 
> if this doesnt work.
>
> - Mike
>
>
>
>
> On 3/17/08 2:57 PM, Ben Clifford wrote:
>> what does your filesystem layout look like?
>>
>> Where are you running swift? And where are you putting your scicortex 
>> site directory? On an NFS that is also accessible from your submit 
>> machine? If so, what path?
>>
>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: BGexec.c
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080317/2017f87c/attachment.c>

From iraicu at cs.uchicago.edu  Mon Mar 17 15:26:39 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Mon, 17 Mar 2008 15:26:39 -0500
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <47DED396.4060304@uchicago.edu>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
	<47DED24B.8070900@mcs.anl.gov> <47DED396.4060304@uchicago.edu>
Message-ID: <47DED3FF.8040105@cs.uchicago.edu>

What does the last option "swift" really do?  Is that the chdir?  If 
yes, then it should be a default behavior (rather than an option) as 
long as the directory field is specified.  Do you have SVN permissions 
to commit changes?  If yes, you should commit this.  If not, I'll commit 
it, and you need to get the permissions to commit code to the Falkon SVN 
(I'll work on this).

Ioan

Zhao Zhang wrote:
> Hi,
>
> The attachment is the BGexec source code to use with both non-swift 
> and swift. To use it, first compile it like "gcc -o BGexec BGexec.c", 
> run invoke it, run "./BGexec 127.0.0.1 55000 55001 -debug swift" the 
> last option is to indicate that we are running BGexec with swift, if 
> not simply change it to no.
>
> zhao
>
> Michael Wilde wrote:
>> Sorry - another mis-diagnosis and incorrect conclusion on my part.
>>
>> Zhao just told me that we have out of date falkon worker code on the 
>> sicortex that is not chdir'ing to the cwd arg of the falkon request.
>>
>> That explains what Im seeing. Its being fixed now and checked in.
>>
>> -- 
>>
>> To answer your questions though:
>>
>> Im running swift on a linux box bblogin.mcs.anl.gov
>>
>> It mounts the sicortex under /sicortex-homes
>>
>> I run swift from /sicortex-homes/wilde/amiga/run
>>
>> My sites file says:
>>
>> <pool handle="sico">
>>       <gridftp  url="local://localhost"/>
>>       <execution provider="deef"
>>
>> url="http://140.221.37.30:50001/wsrf/services/GenericPortal/core/WS/GPFactoryService"/> 
>>
>>       <workdirectory>/home/wilde/swiftwork</workdirectory>
>> </pool>
>>
>> and /home/wilde/swiftwork on bblogin is a symlink to 
>> /sicortex-homes/wilde/swiftwork
>>
>> so that when swift writes files to the sicortex dir (eg when it 
>> creates shared/*) its using the same pathname that the worker-side 
>> will use when the job runs.  Ie, even though the mount-points differ 
>> between the swift host and the worker host, symlinks make the workdir 
>> appear under same name on both sides.
>>
>> If NFS adheres to its close-to-open-coherence semantics, this then 
>> should I think work.
>>
>> My scp-provider question is probably still worth answering and trying 
>> if this doesnt work.
>>
>> - Mike
>>
>>
>>
>>
>> On 3/17/08 2:57 PM, Ben Clifford wrote:
>>> what does your filesystem layout look like?
>>>
>>> Where are you running swift? And where are you putting your 
>>> scicortex site directory? On an NFS that is also accessible from 
>>> your submit machine? If so, what path?
>>>
>>

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


From zhaozhang at uchicago.edu  Mon Mar 17 15:34:06 2008
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Mon, 17 Mar 2008 15:34:06 -0500
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <47DED3FF.8040105@cs.uchicago.edu>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
	<47DED24B.8070900@mcs.anl.gov> <47DED396.4060304@uchicago.edu>
	<47DED3FF.8040105@cs.uchicago.edu>
Message-ID: <47DED5BE.50204@uchicago.edu>

yep, the swift option only cares about "chdir". I am not sure that I 
have the SVN permissions to commit. That will be great, if you commit 
this to the SVN. :-)

zhao

Ioan Raicu wrote:
> What does the last option "swift" really do?  Is that the chdir?  If 
> yes, then it should be a default behavior (rather than an option) as 
> long as the directory field is specified.  Do you have SVN permissions 
> to commit changes?  If yes, you should commit this.  If not, I'll 
> commit it, and you need to get the permissions to commit code to the 
> Falkon SVN (I'll work on this).
>
> Ioan
>
> Zhao Zhang wrote:
>> Hi,
>>
>> The attachment is the BGexec source code to use with both non-swift 
>> and swift. To use it, first compile it like "gcc -o BGexec BGexec.c", 
>> run invoke it, run "./BGexec 127.0.0.1 55000 55001 -debug swift" the 
>> last option is to indicate that we are running BGexec with swift, if 
>> not simply change it to no.
>>
>> zhao
>>
>> Michael Wilde wrote:
>>> Sorry - another mis-diagnosis and incorrect conclusion on my part.
>>>
>>> Zhao just told me that we have out of date falkon worker code on the 
>>> sicortex that is not chdir'ing to the cwd arg of the falkon request.
>>>
>>> That explains what Im seeing. Its being fixed now and checked in.
>>>
>>> -- 
>>>
>>> To answer your questions though:
>>>
>>> Im running swift on a linux box bblogin.mcs.anl.gov
>>>
>>> It mounts the sicortex under /sicortex-homes
>>>
>>> I run swift from /sicortex-homes/wilde/amiga/run
>>>
>>> My sites file says:
>>>
>>> <pool handle="sico">
>>>       <gridftp  url="local://localhost"/>
>>>       <execution provider="deef"
>>>
>>> url="http://140.221.37.30:50001/wsrf/services/GenericPortal/core/WS/GPFactoryService"/> 
>>>
>>>       <workdirectory>/home/wilde/swiftwork</workdirectory>
>>> </pool>
>>>
>>> and /home/wilde/swiftwork on bblogin is a symlink to 
>>> /sicortex-homes/wilde/swiftwork
>>>
>>> so that when swift writes files to the sicortex dir (eg when it 
>>> creates shared/*) its using the same pathname that the worker-side 
>>> will use when the job runs.  Ie, even though the mount-points differ 
>>> between the swift host and the worker host, symlinks make the 
>>> workdir appear under same name on both sides.
>>>
>>> If NFS adheres to its close-to-open-coherence semantics, this then 
>>> should I think work.
>>>
>>> My scp-provider question is probably still worth answering and 
>>> trying if this doesnt work.
>>>
>>> - Mike
>>>
>>>
>>>
>>>
>>> On 3/17/08 2:57 PM, Ben Clifford wrote:
>>>> what does your filesystem layout look like?
>>>>
>>>> Where are you running swift? And where are you putting your 
>>>> scicortex site directory? On an NFS that is also accessible from 
>>>> your submit machine? If so, what path?
>>>>
>>>
>


From benc at hawaga.org.uk  Mon Mar 17 16:13:29 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 17 Mar 2008 21:13:29 +0000 (GMT)
Subject: [Swift-devel] Swift v0.4 released
Message-ID: <Pine.LNX.4.64.0803172113110.28951@dildano.hawaga.org.uk>


Swift 0.4 is released.

You can download it from http://www.ci.uchicago.edu/swift/downloads/

In addition, there are a few pages of release notes detailing the 
substantial changes since v0.3 here: 
http://www.ci.uchicago.edu/swift/packages/release-notes-0.4.txt

-- 


From benc at hawaga.org.uk  Mon Mar 17 17:01:53 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 17 Mar 2008 22:01:53 +0000 (GMT)
Subject: [Swift-devel] google summer of code
Message-ID: <Pine.LNX.4.64.0803172145230.5372@dildano.hawaga.org.uk>


The Globus Alliance was accepted as a Google summer of code mentor 
organization. Under that umbrella, interested students can work on Swift 
related projects.

See http://dev.globus.org/wiki/Google_Summer_of_Code_2008_Ideas for more 
information - there are a few Swift-related projects listed there, but 
Google encourage students to also come up with their own.

-- 


From hategan at mcs.anl.gov  Mon Mar 17 17:27:52 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 17 Mar 2008 17:27:52 -0500
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <Pine.LNX.4.64.0803171910330.28951@dildano.hawaga.org.uk>
References: <47DDBFD7.2050700@mcs.anl.gov> <47DE6C38.40802@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171910330.28951@dildano.hawaga.org.uk>
Message-ID: <1205792872.16095.3.camel@blabla.mcs.anl.gov>


On Mon, 2008-03-17 at 19:16 +0000, Ben Clifford wrote:
> On Mon, 17 Mar 2008, Ioan Raicu wrote:
> 
> > > Failed to transfer wrapper log from amps1-20080316-1643-g4n8t252/info/0/sico
> 
> > I don't understant this error, how is this error text being generated?  Falkon
> > only returns back a numeric exit code.  Could this error be a post processing
> > error when Swift couldn't manipulate the local file system, or it couldn't
> > find some expecting files?  What exit code does Falkon return for this task,
> > 0, or something else?
> 
> 
> That bit is a follow-on error, so pretty much ignore it - I haven't 
> figured out what the right thing to do is for presenting it to the user. 
> Basically:
> 
>   1. swift tries to run an execuable (using provider-deef, in this case)
>   2. run of executable fails (i.e. provider-deef is passing back an erro)
>   3. swift tries to stage back the wrapper log to help diagnosis

Should it perhaps be maybe(transfer(wrapper_log)) instead of
transfer(wrapper_log)?

>   4. the wrapper log doesn't exist (presumably the wrapper never executed 
> that far in the failed executable step 1)
>   5. swift reports the above error/warning that step 3 failed.
> 
> So that line is an error that is a follow on from step 1 failing.
> 
> > We need to figure out if the failure is in executing the tasks in Falkon, or
> > if that is OK, and the error is in Swift not finding some files afterwards.
> 
> Provider-deef is reporting that execution failed. So its not the second 
> one of those. But there isn't enough log information in mike's report to 
> indicate where below the swift/provider-deef interface the error is 
> occurring.
> 
> The present swift+falkon joint deployment code still seems screwy enough 
> to not merge in the provider-deef log4j command, which is annying. I'll 
> have a look at that.
> 


From hategan at mcs.anl.gov  Mon Mar 17 17:32:00 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 17 Mar 2008 17:32:00 -0500
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <47DECC42.4020500@mcs.anl.gov>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
Message-ID: <1205793120.16095.7.camel@blabla.mcs.anl.gov>


On Mon, 2008-03-17 at 14:53 -0500, Michael Wilde wrote:
> OK, thanks, will do.
> 
> My earlier message that "it worked" was premature - my sites file was 
> doing local execution.
> 
> I pushed forward past a few other errors and am now stuck as follows.
> 
> As far as I can tell Im now getting the "NFS not syncing" problem.
> 
> Swift creates the working dir for the workflow as a local file reference 
> to an NFS-mounted directory. When swift tells falkon to run 
> shared/wrapper.sh, its not there yet.  When I look after the workflow 
> has failed, it is indeed there.
> 
> What I'd rather do here is tell swift to use scp rather than 
> direct-file-access as the data provider.  Do you know how to do that?

>From one of the working i2u2 sites.xml files:
 <pool handle="www11" sysinfo="INTEL32::LINUX">
    <filesystem url="www11.i2u2.org" provider="ssh"/>
    <execution url="www11.i2u2.org" provider="ssh"/>
    <workdirectory>/sandbox/quarkcat/tmp</workdirectory>
 </pool>

You'll probably need to configure ~/.ssh/auth.defaults:
www11.i2u2.org.type=key
www11.i2u2.org.username=hategan
www11.i2u2.org.key=/home/mike/.ssh/i2u2portal
www11.i2u2.org.passphrase=...


>  Or 
> are there any other data transports to consider?
> 
> (One alternative is to get gridftp running on sico).
> 
> 
> 
> On 3/17/08 2:19 PM, Ben Clifford wrote:
> > On Mon, 17 Mar 2008, Ioan Raicu wrote:
> > 
> >> iraicu at viper:~/java/svn/cog/modules/provider-deef/etc> cat
> >> log4j.properties.module
> >> log4j.logger.org.apache.axis.utils.JavaUtils=ERROR
> >> log4j.logger.org.globus.cog.abstraction.impl.execution.deef=DEBUG
> >>
> >> I have this log4j property, do you have this enabled?  This should 
> >> enable more debug output from the Falkon provider.
> > 
> > If deploying swift with this command: ant -Dwith-provider-deef redist
> > then it looks like those lines don't get merged in. Mike, add those lines 
> > yourself to your dist/swift-0.3-dev/etc/log4j.properties file.
> > 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From benc at hawaga.org.uk  Mon Mar 17 20:48:19 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 18 Mar 2008 01:48:19 +0000 (GMT)
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <1205793120.16095.7.camel@blabla.mcs.anl.gov>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<1205793120.16095.7.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0803180147280.28951@dildano.hawaga.org.uk>


On Mon, 17 Mar 2008, Mihael Hategan wrote:

> You'll probably need to configure ~/.ssh/auth.defaults:

> www11.i2u2.org.passphrase=...

ick passwords in config files.

-- 


From benc at hawaga.org.uk  Mon Mar 17 20:57:50 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 18 Mar 2008 01:57:50 +0000 (GMT)
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <1205792872.16095.3.camel@blabla.mcs.anl.gov>
References: <47DDBFD7.2050700@mcs.anl.gov> <47DE6C38.40802@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171910330.28951@dildano.hawaga.org.uk>
	<1205792872.16095.3.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0803180154290.28951@dildano.hawaga.org.uk>


On Mon, 17 Mar 2008, Mihael Hategan wrote:

> Should it perhaps be maybe(transfer(wrapper_log)) instead of
> transfer(wrapper_log)?

In other circumstances, though, people get upset that their job status 
files didn't get transferred back but that there was no visible error.

This same UI conflict occurs with kickstart if that is turned on, I think 
- failed jobs cause visible kickstart transfer errors. Its made more 
viisble here because wrapper logs are always generated.

-- 


From benc at hawaga.org.uk  Tue Mar 18 01:27:46 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 18 Mar 2008 06:27:46 +0000 (GMT)
Subject: [Swift-devel] install directory change. read this - it will break
	your build.
Message-ID: <Pine.LNX.4.64.0803180617560.28951@dildano.hawaga.org.uk>


If you are an SVN user (rather than using mainline Swift releases 
downloaded in .tar.gz form), read the following and obey the one 
instruction it contains.

The one instruction is:

  delete your cog/modules/vdsk/dist/ directory

Further information about this one instruction which you must obey 
follows. You do not have to read this:

I just committed a change to make the SVN version of swift be 'svn' rather 
than 0.3-dev. That means that swift will now build to:

dist/vdsk-svn

instead of dist/vdsk-0.3-dev

When you do an svn update and an ant redist, subsequent builds will go 
into the above directory.

However, if you already have a dist/vdsk-0.3-dev directory in place, it 
will be left there, with the previous version of swift there. This is 
almost definitely undesirable for you and you should delete that 
directory. My simplest advice is to remove the entire dist/ directory 
before making a rebuild.

If you do not delete this directory, you will almost definitely 
accidentally leave paths pointing at the old build directory, and you will 
therefore almost definitely experience confusion later on when new 
functionality and bugs do not appear, and old bugs do not disappear.

All of the above references to 'almost definitely' come from experience 
the last times we've bumped the version number; they will almost 
definitely cause you trouble if you do not read and act on this mail.

I've changed the in-SVN version to 'svn' on the basis that there is enough 
other information about SVN version numbers to render the '0.n-dev' string 
pretty useless and no longer worth the above mentioned trouble every time 
a release is made.

-- 


From benc at hawaga.org.uk  Tue Mar 18 02:18:26 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 18 Mar 2008 07:18:26 +0000 (GMT)
Subject: ssh provider doc (was Re: [Swift-devel] Re: swift-falkon problem)
In-Reply-To: <1205793120.16095.7.camel@blabla.mcs.anl.gov>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<1205793120.16095.7.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0803180717060.28951@dildano.hawaga.org.uk>


I've put basically the below into the userguide in the sites.xml 
configuartion section, alongside notes about using the other providers.

On Mon, 17 Mar 2008, Mihael Hategan wrote:

> >From one of the working i2u2 sites.xml files:
>  <pool handle="www11" sysinfo="INTEL32::LINUX">
>     <filesystem url="www11.i2u2.org" provider="ssh"/>
>     <execution url="www11.i2u2.org" provider="ssh"/>
>     <workdirectory>/sandbox/quarkcat/tmp</workdirectory>
>  </pool>
> 
> You'll probably need to configure ~/.ssh/auth.defaults:
> www11.i2u2.org.type=key
> www11.i2u2.org.username=hategan
> www11.i2u2.org.key=/home/mike/.ssh/i2u2portal
> www11.i2u2.org.passphrase=...

-- 


From wilde at mcs.anl.gov  Tue Mar 18 09:05:39 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 18 Mar 2008 09:05:39 -0500
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <47DED24B.8070900@mcs.anl.gov>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
	<47DED24B.8070900@mcs.anl.gov>
Message-ID: <47DFCC33.30803@mcs.anl.gov>

Moving forward on this:

Zhao's update to the falkon worker agent "bgexec" fixed the problem of 
not finding wrapper.sh on the worker node.

With the new bgexec in place, the workflow ran successfully for runs of 
1 job and 25 jobs.

In a run of 100 jobs I start to see problems:

- 89 of 100 jobs produced output data files on shared/
- 89 info files, 60 success files
- 29 output files made it back to the swift run directory
   (amdi.*)

All the logs and the server-side runtime directory are on the CI FS at
~benc/swift-logs/wilde/run313

I am debugging this, but if you could take a look Ben that would be great.

I will test the jobs locally to ensure that all 100 parameters yield 
successful output. But the app - a shell around a C program - should 
yield a zero-length file when the job fails and a single decimal number 
when it succeeds.

This is still running with locally mounted NFS for data access.
I will try the ssh approach after I rule out problems in my app.

After mis-judging the previous problem as an NFS coherence issue, I dont 
want to be hasty in prejudging this one.

- Mike


On 3/17/08 3:19 PM, Michael Wilde wrote:
> Sorry - another mis-diagnosis and incorrect conclusion on my part.
> 
> Zhao just told me that we have out of date falkon worker code on the 
> sicortex that is not chdir'ing to the cwd arg of the falkon request.
> 
> That explains what Im seeing. Its being fixed now and checked in.
> 
> -- 
> 
> To answer your questions though:
> 
> Im running swift on a linux box bblogin.mcs.anl.gov
> 
> It mounts the sicortex under /sicortex-homes
> 
> I run swift from /sicortex-homes/wilde/amiga/run
> 
> My sites file says:
> 
> <pool handle="sico">
>       <gridftp  url="local://localhost"/>
>       <execution provider="deef"
> 
> url="http://140.221.37.30:50001/wsrf/services/GenericPortal/core/WS/GPFactoryService"/> 
> 
>       <workdirectory>/home/wilde/swiftwork</workdirectory>
> </pool>
> 
> and /home/wilde/swiftwork on bblogin is a symlink to 
> /sicortex-homes/wilde/swiftwork
> 
> so that when swift writes files to the sicortex dir (eg when it creates 
> shared/*) its using the same pathname that the worker-side will use when 
> the job runs.  Ie, even though the mount-points differ between the swift 
> host and the worker host, symlinks make the workdir appear under same 
> name on both sides.
> 
> If NFS adheres to its close-to-open-coherence semantics, this then 
> should I think work.
> 
> My scp-provider question is probably still worth answering and trying if 
> this doesnt work.
> 
> - Mike
> 
> 
> 
> 
> On 3/17/08 2:57 PM, Ben Clifford wrote:
>> what does your filesystem layout look like?
>>
>> Where are you running swift? And where are you putting your scicortex 
>> site directory? On an NFS that is also accessible from your submit 
>> machine? If so, what path?
>>
> 


From zhaozhang at uchicago.edu  Tue Mar 18 12:00:26 2008
From: zhaozhang at uchicago.edu (Zhao Zhang)
Date: Tue, 18 Mar 2008 12:00:26 -0500
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <47DFCC33.30803@mcs.anl.gov>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
	<47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov>
Message-ID: <47DFF52A.4050103@uchicago.edu>

Hi, Mike

Are you running runam4?  I think there is a range of the variable we 
chose, so in my test, I made sure, each input has an output. Not all 
input data will have results.

zhao

Michael Wilde wrote:
> Moving forward on this:
>
> Zhao's update to the falkon worker agent "bgexec" fixed the problem of 
> not finding wrapper.sh on the worker node.
>
> With the new bgexec in place, the workflow ran successfully for runs 
> of 1 job and 25 jobs.
>
> In a run of 100 jobs I start to see problems:
>
> - 89 of 100 jobs produced output data files on shared/
> - 89 info files, 60 success files
> - 29 output files made it back to the swift run directory
>   (amdi.*)
>
> All the logs and the server-side runtime directory are on the CI FS at
> ~benc/swift-logs/wilde/run313
>
> I am debugging this, but if you could take a look Ben that would be 
> great.
>
> I will test the jobs locally to ensure that all 100 parameters yield 
> successful output. But the app - a shell around a C program - should 
> yield a zero-length file when the job fails and a single decimal 
> number when it succeeds.
>
> This is still running with locally mounted NFS for data access.
> I will try the ssh approach after I rule out problems in my app.
>
> After mis-judging the previous problem as an NFS coherence issue, I 
> dont want to be hasty in prejudging this one.
>
> - Mike
>
>
>
> On 3/17/08 3:19 PM, Michael Wilde wrote:
>> Sorry - another mis-diagnosis and incorrect conclusion on my part.
>>
>> Zhao just told me that we have out of date falkon worker code on the 
>> sicortex that is not chdir'ing to the cwd arg of the falkon request.
>>
>> That explains what Im seeing. Its being fixed now and checked in.
>>
>> -- 
>>
>> To answer your questions though:
>>
>> Im running swift on a linux box bblogin.mcs.anl.gov
>>
>> It mounts the sicortex under /sicortex-homes
>>
>> I run swift from /sicortex-homes/wilde/amiga/run
>>
>> My sites file says:
>>
>> <pool handle="sico">
>>       <gridftp  url="local://localhost"/>
>>       <execution provider="deef"
>>
>> url="http://140.221.37.30:50001/wsrf/services/GenericPortal/core/WS/GPFactoryService"/> 
>>
>>       <workdirectory>/home/wilde/swiftwork</workdirectory>
>> </pool>
>>
>> and /home/wilde/swiftwork on bblogin is a symlink to 
>> /sicortex-homes/wilde/swiftwork
>>
>> so that when swift writes files to the sicortex dir (eg when it 
>> creates shared/*) its using the same pathname that the worker-side 
>> will use when the job runs.  Ie, even though the mount-points differ 
>> between the swift host and the worker host, symlinks make the workdir 
>> appear under same name on both sides.
>>
>> If NFS adheres to its close-to-open-coherence semantics, this then 
>> should I think work.
>>
>> My scp-provider question is probably still worth answering and trying 
>> if this doesnt work.
>>
>> - Mike
>>
>>
>>
>>
>> On 3/17/08 2:57 PM, Ben Clifford wrote:
>>> what does your filesystem layout look like?
>>>
>>> Where are you running swift? And where are you putting your 
>>> scicortex site directory? On an NFS that is also accessible from 
>>> your submit machine? If so, what path?
>>>
>>
>


From benc at hawaga.org.uk  Tue Mar 18 13:52:04 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 18 Mar 2008 18:52:04 +0000 (GMT)
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <47DFCC33.30803@mcs.anl.gov>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
	<47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>


On Tue, 18 Mar 2008, Michael Wilde wrote:

> I will test the jobs locally to ensure that all 100 parameters yield
> successful output. But the app - a shell around a C program - should yield a
> zero-length file when the job fails and a single decimal number when it
> succeeds.

yes, please run exactly the same 100 parameter SwiftScript with the local 
provider. ideally twice or three times.

-- 


From benc at hawaga.org.uk  Tue Mar 18 13:55:37 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 18 Mar 2008 18:55:37 +0000 (GMT)
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
	<47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov>
	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>

if you want a more tested simple-but-large-numbers of jobs test app, get 
SwiftApps/badmonkey/ from the SVN. that's what I use.

-- 


From benc at hawaga.org.uk  Tue Mar 18 15:57:37 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 18 Mar 2008 20:57:37 +0000 (GMT)
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
	<47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov>
	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>


I picked the first failed job in the log oyu sent. Job id 2qbcdypi.

I assume that your submit host and the various machines involved have 
properly synchronised clocks, but I have not checked this beyond seeing 
that the machine I am logged into has the same time as my laptop. I have 
labelled the times taken from different system clocks with lettered clock 
domains just in case they are different.

For this job, its running in thread 0-1-88.
The karajan level job submission goes through these states (in clock 
domain A)
23:14:08,196-0600 Submitting
23:14:08,204-0600 Submitted
23:14:14,121-0600 Active
23:14:14,121-0600 Completed

Note that the last two - Active and Completed - are the same (within a 
millisecond)

At 23:14:14,189-0600 Swift checks the job status and finds the success 
file is not found. (This timestamp is in clock domain A)

So now I look at for the status file myself on the fd filesystem:

$ ls --full-time 
/home/wilde/swiftwork/amps1-20080317-2314-nvn8x1p2/status/2/runam3-2pbcdypi-success 

-rw-r--r-- 1 wilde mcsz 0 2008-03-18 00:14:19.202382966 -0500 
/home/wilde/swiftwork/amps1-20080317-2314-nvn8x1p2/status/2/runam3-2pbcdypi-success

(this is in clock domain B)

And see that the file does exist but is a full 5 seconds after the job was 
reported as successful by provider-deef.

So now we can look in the info/ directory (next to the status directory) 
and get run time stamps or the jobs.

According to the info log, the job begins running at: (in clock domain B 
again) at:

00:14:14.065373000-0500

which corresponds within about 60ms of the time that provider-deef 
reported the job as active.
However, the execution according to the wrapper log shows that the job did 
not finish executing until

00:14:19.233438000-0500

(which is when the status file is approximately timestamped).

My off-the-cuff hypothesis is, based on the above, that soemwhere in 
provider-deef or below, the execution system is reporting a job as 
completed as soon as it starts executing, rather than when it actually 
finishes executing; and that successes with small numbers of jobs have 
been a race condition that would disappear if those small jobs took a 
substantially longer time to execute (eg if they had a sleep 30s in them).

-- 


From iraicu at cs.uchicago.edu  Tue Mar 18 16:36:21 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Tue, 18 Mar 2008 16:36:21 -0500
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>
References: <47DDBFD7.2050700@mcs.anl.gov>	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>	<47DE676A.8090604@cs.uchicago.edu>	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>	<47DECC42.4020500@mcs.anl.gov>	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>	<47DED24B.8070900@mcs.anl.gov>
	<47DFCC33.30803@mcs.anl.gov>	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>
Message-ID: <47E035D5.6000804@cs.uchicago.edu>

I would say that Falkon to send a successful exit code at the start of 
the execution is impossible (unless its a bug that I have never seen 
before)... it could certainly send a failed exit code before the task 
even starts under certain conditions, but if an exit code of 0 is 
received at Swift, I would say that the task executed on the remote 
resource, and an exit code 0 was propagated back to Swift.  Could a 
latency of NFS in which one node creates a file/dir and another node 
requires xxx time (in this case, 5 sec) before it actually sees the 
file, explain what Mike is seeing?  If this is a likely explanation, 
then the race condition is that the exit code goes from worker to Falkon 
service to Swift faster than NFS can update its file/dir list, and when 
Swift checks for the file or dir (probably within 10s of milliseconds) 
of the job completion, it can't find the file/dir.  Are there any 
counterarguments that would make this hypothesis not possible?  Just 
another hypothesis which might be worth investigating.

Ioan


Ben Clifford wrote:
> I picked the first failed job in the log oyu sent. Job id 2qbcdypi.
>
> I assume that your submit host and the various machines involved have 
> properly synchronised clocks, but I have not checked this beyond seeing 
> that the machine I am logged into has the same time as my laptop. I have 
> labelled the times taken from different system clocks with lettered clock 
> domains just in case they are different.
>
> For this job, its running in thread 0-1-88.
> The karajan level job submission goes through these states (in clock 
> domain A)
> 23:14:08,196-0600 Submitting
> 23:14:08,204-0600 Submitted
> 23:14:14,121-0600 Active
> 23:14:14,121-0600 Completed
>
> Note that the last two - Active and Completed - are the same (within a 
> millisecond)
>
> At 23:14:14,189-0600 Swift checks the job status and finds the success 
> file is not found. (This timestamp is in clock domain A)
>
> So now I look at for the status file myself on the fd filesystem:
>
> $ ls --full-time 
> /home/wilde/swiftwork/amps1-20080317-2314-nvn8x1p2/status/2/runam3-2pbcdypi-success 
>
> -rw-r--r-- 1 wilde mcsz 0 2008-03-18 00:14:19.202382966 -0500 
> /home/wilde/swiftwork/amps1-20080317-2314-nvn8x1p2/status/2/runam3-2pbcdypi-success
>
> (this is in clock domain B)
>
> And see that the file does exist but is a full 5 seconds after the job was 
> reported as successful by provider-deef.
>
> So now we can look in the info/ directory (next to the status directory) 
> and get run time stamps or the jobs.
>
> According to the info log, the job begins running at: (in clock domain B 
> again) at:
>
> 00:14:14.065373000-0500
>
> which corresponds within about 60ms of the time that provider-deef 
> reported the job as active.
> However, the execution according to the wrapper log shows that the job did 
> not finish executing until
>
> 00:14:19.233438000-0500
>
> (which is when the status file is approximately timestamped).
>
> My off-the-cuff hypothesis is, based on the above, that soemwhere in 
> provider-deef or below, the execution system is reporting a job as 
> completed as soon as it starts executing, rather than when it actually 
> finishes executing; and that successes with small numbers of jobs have 
> been a race condition that would disappear if those small jobs took a 
> substantially longer time to execute (eg if they had a sleep 30s in them).
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


From benc at hawaga.org.uk  Tue Mar 18 16:45:56 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 18 Mar 2008 21:45:56 +0000 (GMT)
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <47E035D5.6000804@cs.uchicago.edu>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
	<47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov>
	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>
	<47E035D5.6000804@cs.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0803182136010.28951@dildano.hawaga.org.uk>


On Tue, 18 Mar 2008, Ioan Raicu wrote:

> Could a latency of NFS in which one node creates a
> file/dir and another node requires xxx time (in this case, 5 sec) before it
> actually sees the file, explain what Mike is seeing?  If this is a likely
> explanation, then the race condition is that the exit code goes from worker to
> Falkon service to Swift faster than NFS can update its file/dir list, and when
> Swift checks for the file or dir (probably within 10s of milliseconds) of the
> job completion, it can't find the file/dir.  Are there any counterarguments
> that would make this hypothesis not possible?  Just another hypothesis which
> might be worth investigating.
> 

According to the timing in the log file, Swift is getting a notification 
from provider-deef that the job completed before the actual job has even 
been run to completion on the worker, well before the wrapper even 
attempts to write out a status file.

I'm not accusing this of being a problem inside Falkon - I'm saying I 
think its happening somewhere below the Swift layer, so it could well be 
provider-deef, which is probably the most neglected part of this whole 
stack.

Mike, are you running with those extra debug lines in the log4j 
configuration? If not, please run again with them turned on. Also Ioan can 
probably recommend which Falkon logs to keep so we can see what's 
happening for a job there and approach the problem from the other end of 
the stack too.


-- 


From wilde at mcs.anl.gov  Tue Mar 18 17:12:12 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 18 Mar 2008 17:12:12 -0500
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <Pine.LNX.4.64.0803182136010.28951@dildano.hawaga.org.uk>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
	<47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov>
	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>
	<47E035D5.6000804@cs.uchicago.edu>
	<Pine.LNX.4.64.0803182136010.28951@dildano.hawaga.org.uk>
Message-ID: <47E03E3C.6060905@mcs.anl.gov>

I will rerun with log4j settings.
Will also try adding the sleep suggested earlier - to see if all jobs 
then fail.

I did re-run the workflow 3X on local, and each time all 100 jobs 
finished successfully. Also for this dataset, all jobs return data.

- Mike

On 3/18/08 4:45 PM, Ben Clifford wrote:
> On Tue, 18 Mar 2008, Ioan Raicu wrote:
> 
>> Could a latency of NFS in which one node creates a
>> file/dir and another node requires xxx time (in this case, 5 sec) before it
>> actually sees the file, explain what Mike is seeing?  If this is a likely
>> explanation, then the race condition is that the exit code goes from worker to
>> Falkon service to Swift faster than NFS can update its file/dir list, and when
>> Swift checks for the file or dir (probably within 10s of milliseconds) of the
>> job completion, it can't find the file/dir.  Are there any counterarguments
>> that would make this hypothesis not possible?  Just another hypothesis which
>> might be worth investigating.
>>
> 
> According to the timing in the log file, Swift is getting a notification 
> from provider-deef that the job completed before the actual job has even 
> been run to completion on the worker, well before the wrapper even 
> attempts to write out a status file.
> 
> I'm not accusing this of being a problem inside Falkon - I'm saying I 
> think its happening somewhere below the Swift layer, so it could well be 
> provider-deef, which is probably the most neglected part of this whole 
> stack.
> 
> Mike, are you running with those extra debug lines in the log4j 
> configuration? If not, please run again with them turned on. Also Ioan can 
> probably recommend which Falkon logs to keep so we can see what's 
> happening for a job there and approach the problem from the other end of 
> the stack too.
> 
> 


From iraicu at cs.uchicago.edu  Tue Mar 18 17:20:26 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Tue, 18 Mar 2008 17:20:26 -0500
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <Pine.LNX.4.64.0803182136010.28951@dildano.hawaga.org.uk>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
	<47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov>
	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>
	<47E035D5.6000804@cs.uchicago.edu>
	<Pine.LNX.4.64.0803182136010.28951@dildano.hawaga.org.uk>
Message-ID: <47E0402A.8010303@cs.uchicago.edu>

The clocks on the two machines that Mike was running on seems to be in 
sync (less than 1 sec off).

iraicu at bblogin:~/java/svn/falkon$ date
Tue Mar 18 17:10:15 CDT 2008

iraicu at scx-m23n6 ~/java/svn/falkon/worker/temp $ date
Tue Mar 18 17:10:15 CDT 2008

Mike, here are the logs you need to make sure you capture when running 
in debug mode:
iraicu at viper:~/java/svn/falkon/config> cat Falkon-TCPCore.config
GenericPortalWS=falkon_task_submission_history.txt
GenericPortalWS_perf_per_sec=falkon_summary.txt
GenericPortalWS_taskPerf=falkon_task_perf.txt
GenericPortalWS_task=falkon_task_status.txt

When running in normal mode (when we know things work fine), we just need
iraicu at viper:~/java/svn/falkon/config> cat Falkon-TCPCore.config
GenericPortalWS_perf_per_sec=falkon_summary.txt
GenericPortalWS_taskPerf=falkon_task_perf.txt

In the event that we can't figure out things from the Swift and Falkon 
service logs, we might have to enable worker side logs as well, which 
you do from the run.worker-c.sh (or run.worker-c-ram.sh) script(s).

Its also possible that the Falkon provider code is doing something 
funny, but I'd want to see the Falkon logs before we focus on the provider.

Ioan


Ben Clifford wrote:
> On Tue, 18 Mar 2008, Ioan Raicu wrote:
>
>   
>> Could a latency of NFS in which one node creates a
>> file/dir and another node requires xxx time (in this case, 5 sec) before it
>> actually sees the file, explain what Mike is seeing?  If this is a likely
>> explanation, then the race condition is that the exit code goes from worker to
>> Falkon service to Swift faster than NFS can update its file/dir list, and when
>> Swift checks for the file or dir (probably within 10s of milliseconds) of the
>> job completion, it can't find the file/dir.  Are there any counterarguments
>> that would make this hypothesis not possible?  Just another hypothesis which
>> might be worth investigating.
>>
>>     
>
> According to the timing in the log file, Swift is getting a notification 
> from provider-deef that the job completed before the actual job has even 
> been run to completion on the worker, well before the wrapper even 
> attempts to write out a status file.
>
> I'm not accusing this of being a problem inside Falkon - I'm saying I 
> think its happening somewhere below the Swift layer, so it could well be 
> provider-deef, which is probably the most neglected part of this whole 
> stack.
>
> Mike, are you running with those extra debug lines in the log4j 
> configuration? If not, please run again with them turned on. Also Ioan can 
> probably recommend which Falkon logs to keep so we can see what's 
> happening for a job there and approach the problem from the other end of 
> the stack too.
>
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


From wilde at mcs.anl.gov  Tue Mar 18 17:29:01 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 18 Mar 2008 17:29:01 -0500
Subject: [Swift-devel] Why build prompts in redist?
Message-ID: <47E0422D.9000707@mcs.anl.gov>

When doing an ant redist, I get:

dist.dir.warning:
======================================================================================
     [input] Warning! The specified target directory 
(/radix-homes01/wilde/swift/src/cog/modules/vdsk/../..//modules/vdsk/dist/vdsk-svn) 
does not seem to contain a Swift build.
     [input] Press Return to continue with the build or CTRL+C to abort...
     [input] 
======================================================================================

Is this check really useful? Its inconvenient when you start a build and 
then walk away from it. You come back and its waiting on this prompt.


From benc at hawaga.org.uk  Tue Mar 18 17:37:01 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 18 Mar 2008 22:37:01 +0000 (GMT)
Subject: [Swift-devel] Why build prompts in redist?
In-Reply-To: <47E0422D.9000707@mcs.anl.gov>
References: <47E0422D.9000707@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0803182236000.5372@dildano.hawaga.org.uk>


> Is this check really useful? Its inconvenient when you start a build and then
> walk away from it. You come back and its waiting on this prompt.

It used to be very useful.

If you are building with the instructions in provider-deef/README (note 
that these were updated in the last week or so), then no, it isn't useful.

-- 


From benc at hawaga.org.uk  Tue Mar 18 20:40:12 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 19 Mar 2008 01:40:12 +0000 (GMT)
Subject: [Swift-devel] Why build prompts in redist?
In-Reply-To: <47E0422D.9000707@mcs.anl.gov>
References: <47E0422D.9000707@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0803190139490.5372@dildano.hawaga.org.uk>


>     [input] Warning! The specified target directory
> (/radix-homes01/wilde/swift/src/cog/modules/vdsk/../..//modules/vdsk/dist/vdsk-svn)
> does not seem to contain a Swift build.
>     [input] Press Return to continue with the build or CTRL+C to abort...


As of r1738 (to provider-deef) this does not happen any more.

-- 


From hategan at mcs.anl.gov  Wed Mar 19 03:25:16 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 19 Mar 2008 03:25:16 -0500
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
	<47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov>
	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>
Message-ID: <1205915116.21170.1.camel@blabla.mcs.anl.gov>


On Tue, 2008-03-18 at 20:57 +0000, Ben Clifford wrote:
> I picked the first failed job in the log oyu sent. Job id 2qbcdypi.
> 
> I assume that your submit host and the various machines involved have 
> properly synchronised clocks, but I have not checked this beyond seeing 
> that the machine I am logged into has the same time as my laptop. I have 
> labelled the times taken from different system clocks with lettered clock 
> domains just in case they are different.
> 
> For this job, its running in thread 0-1-88.
> The karajan level job submission goes through these states (in clock 
> domain A)
> 23:14:08,196-0600 Submitting
> 23:14:08,204-0600 Submitted
> 23:14:14,121-0600 Active
> 23:14:14,121-0600 Completed
> 
> Note that the last two - Active and Completed - are the same (within a 
> millisecond)

That probably means the provider doesn't really set the active state,
and it gets filled in when "completed" arrives.


From hategan at mcs.anl.gov  Wed Mar 19 03:25:50 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 19 Mar 2008 03:25:50 -0500
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <47E035D5.6000804@cs.uchicago.edu>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
	<47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov>
	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>
	<47E035D5.6000804@cs.uchicago.edu>
Message-ID: <1205915150.21170.3.camel@blabla.mcs.anl.gov>


On Tue, 2008-03-18 at 16:36 -0500, Ioan Raicu wrote:
> I would say that Falkon to send a successful exit code at the start of 
> the execution is impossible (unless its a bug that I have never seen 
> before)... 

:)
Like any new bug?

> Ioan
> 
> 
> Ben Clifford wrote:
> > I picked the first failed job in the log oyu sent. Job id 2qbcdypi.
> >
> > I assume that your submit host and the various machines involved have 
> > properly synchronised clocks, but I have not checked this beyond seeing 
> > that the machine I am logged into has the same time as my laptop. I have 
> > labelled the times taken from different system clocks with lettered clock 
> > domains just in case they are different.
> >
> > For this job, its running in thread 0-1-88.
> > The karajan level job submission goes through these states (in clock 
> > domain A)
> > 23:14:08,196-0600 Submitting
> > 23:14:08,204-0600 Submitted
> > 23:14:14,121-0600 Active
> > 23:14:14,121-0600 Completed
> >
> > Note that the last two - Active and Completed - are the same (within a 
> > millisecond)
> >
> > At 23:14:14,189-0600 Swift checks the job status and finds the success 
> > file is not found. (This timestamp is in clock domain A)
> >
> > So now I look at for the status file myself on the fd filesystem:
> >
> > $ ls --full-time 
> > /home/wilde/swiftwork/amps1-20080317-2314-nvn8x1p2/status/2/runam3-2pbcdypi-success 
> >
> > -rw-r--r-- 1 wilde mcsz 0 2008-03-18 00:14:19.202382966 -0500 
> > /home/wilde/swiftwork/amps1-20080317-2314-nvn8x1p2/status/2/runam3-2pbcdypi-success
> >
> > (this is in clock domain B)
> >
> > And see that the file does exist but is a full 5 seconds after the job was 
> > reported as successful by provider-deef.
> >
> > So now we can look in the info/ directory (next to the status directory) 
> > and get run time stamps or the jobs.
> >
> > According to the info log, the job begins running at: (in clock domain B 
> > again) at:
> >
> > 00:14:14.065373000-0500
> >
> > which corresponds within about 60ms of the time that provider-deef 
> > reported the job as active.
> > However, the execution according to the wrapper log shows that the job did 
> > not finish executing until
> >
> > 00:14:19.233438000-0500
> >
> > (which is when the status file is approximately timestamped).
> >
> > My off-the-cuff hypothesis is, based on the above, that soemwhere in 
> > provider-deef or below, the execution system is reporting a job as 
> > completed as soon as it starts executing, rather than when it actually 
> > finishes executing; and that successes with small numbers of jobs have 
> > been a race condition that would disappear if those small jobs took a 
> > substantially longer time to execute (eg if they had a sleep 30s in them).
> >
> >   
> 
> -- 
> ===================================================
> Ioan Raicu
> Ph.D. Candidate
> ===================================================
> Distributed Systems Laboratory
> Computer Science Department
> University of Chicago
> 1100 E. 58th Street, Ryerson Hall
> Chicago, IL 60637
> ===================================================
> Email: iraicu at cs.uchicago.edu
> Web:   http://www.cs.uchicago.edu/~iraicu
> http://dev.globus.org/wiki/Incubator/Falkon
> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
> ===================================================
> ===================================================
> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From benc at hawaga.org.uk  Wed Mar 19 03:33:00 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 19 Mar 2008 08:33:00 +0000 (GMT)
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <1205915116.21170.1.camel@blabla.mcs.anl.gov>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
	<47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> 
	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk> 
	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk> 
	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>
	<1205915116.21170.1.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0803190832200.28951@dildano.hawaga.org.uk>


On Wed, 19 Mar 2008, Mihael Hategan wrote:

> > Note that the last two - Active and Completed - are the same (within a 
> > millisecond)
> 
> That probably means the provider doesn't really set the active state,
> and it gets filled in when "completed" arrives.

Indeed the provider doesn't set Active anywhere.

But the time of the above events is still many seconds too early.

-- 


From iraicu at cs.uchicago.edu  Wed Mar 19 06:15:27 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Wed, 19 Mar 2008 06:15:27 -0500
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <1205915116.21170.1.camel@blabla.mcs.anl.gov>
References: <47DDBFD7.2050700@mcs.anl.gov>	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>	<47DE676A.8090604@cs.uchicago.edu>	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>	<47DECC42.4020500@mcs.anl.gov>	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>	<47DED24B.8070900@mcs.anl.gov>
	<47DFCC33.30803@mcs.anl.gov>	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>
	<1205915116.21170.1.camel@blabla.mcs.anl.gov>
Message-ID: <47E0F5CF.1090000@cs.uchicago.edu>

Right, from what I remember, it never sets the active state.  The jobs 
in question probably took less than 1 sec to execute, so seeing 8 
seconds between submitted and completed looks fine to me.  The fact that 
the timestampts on the file/dir is later than the time Falkon says the 
job completed is an indication that either the clocks are not in sync 
(bbogin and fd-login.mcs are in sync, but what about bblogin and 
SiCortex compute nodes?), or NFS did  not process the write operation 
immediately, and under the heavy load of 60 workers ll writing at the 
same time, it took 5 seconds to complete the write operation.  Mike, 
where are the Falkon logs, to see what happened from Falkon's point of view.

Ioan

Mihael Hategan wrote:
> On Tue, 2008-03-18 at 20:57 +0000, Ben Clifford wrote:
>   
>> I picked the first failed job in the log oyu sent. Job id 2qbcdypi.
>>
>> I assume that your submit host and the various machines involved have 
>> properly synchronised clocks, but I have not checked this beyond seeing 
>> that the machine I am logged into has the same time as my laptop. I have 
>> labelled the times taken from different system clocks with lettered clock 
>> domains just in case they are different.
>>
>> For this job, its running in thread 0-1-88.
>> The karajan level job submission goes through these states (in clock 
>> domain A)
>> 23:14:08,196-0600 Submitting
>> 23:14:08,204-0600 Submitted
>> 23:14:14,121-0600 Active
>> 23:14:14,121-0600 Completed
>>
>> Note that the last two - Active and Completed - are the same (within a 
>> millisecond)
>>     
>
> That probably means the provider doesn't really set the active state,
> and it gets filled in when "completed" arrives.
>
>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080319/fd2d2ad4/attachment.html>

From wilde at mcs.anl.gov  Wed Mar 19 10:45:50 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 19 Mar 2008 10:45:50 -0500
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
	<47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov>
	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>
Message-ID: <47E1352E.3070808@mcs.anl.gov>

Following up on Ben's request from the msg below:

 > My off-the-cuff hypothesis is, based on the above, that soemwhere in
 > provider-deef or below, the execution system is reporting a job as
 > completed as soon as it starts executing, rather than when it actually
 > finishes executing; and that successes with small numbers of jobs have
 > been a race condition that would disappear if those small jobs took a
 > substantially longer time to execute (eg if they had a sleep 30s in 
them).
 >

I tested the following:

run314: 100 jobs, 10 workers: all finished OK

run315: 100 jobs, 110 workers: ~80% failed

run316: 100 jobs, 110 workers, sleep 30 in the app: all finished OK

These are in ~benc/swift-logs/wilde. The workdirs are preserved on 
bblogin/sico - I did not copy them because you need access to the msec 
timestamps anyways.

I can run these several times each to get more data before we assess the 
hypothesis, but didnt have time yet. Let me know if thats needed.

I'm cautiously leaning a bit more to the NFS-race theory. I would like 
to test with scp data transfer.  Am also trying to get gridftp compiled 
there with help from Raj.  Build is failing with gpt problems, I think I 
need Ben or Charles on this.

- Mike


On 3/18/08 3:57 PM, Ben Clifford wrote:
> I picked the first failed job in the log oyu sent. Job id 2qbcdypi.
> 
> I assume that your submit host and the various machines involved have 
> properly synchronised clocks, but I have not checked this beyond seeing 
> that the machine I am logged into has the same time as my laptop. I have 
> labelled the times taken from different system clocks with lettered clock 
> domains just in case they are different.
> 
> For this job, its running in thread 0-1-88.
> The karajan level job submission goes through these states (in clock 
> domain A)
> 23:14:08,196-0600 Submitting
> 23:14:08,204-0600 Submitted
> 23:14:14,121-0600 Active
> 23:14:14,121-0600 Completed
> 
> Note that the last two - Active and Completed - are the same (within a 
> millisecond)
> 
> At 23:14:14,189-0600 Swift checks the job status and finds the success 
> file is not found. (This timestamp is in clock domain A)
> 
> So now I look at for the status file myself on the fd filesystem:
> 
> $ ls --full-time 
> /home/wilde/swiftwork/amps1-20080317-2314-nvn8x1p2/status/2/runam3-2pbcdypi-success 
> 
> -rw-r--r-- 1 wilde mcsz 0 2008-03-18 00:14:19.202382966 -0500 
> /home/wilde/swiftwork/amps1-20080317-2314-nvn8x1p2/status/2/runam3-2pbcdypi-success
> 
> (this is in clock domain B)
> 
> And see that the file does exist but is a full 5 seconds after the job was 
> reported as successful by provider-deef.
> 
> So now we can look in the info/ directory (next to the status directory) 
> and get run time stamps or the jobs.
> 
> According to the info log, the job begins running at: (in clock domain B 
> again) at:
> 
> 00:14:14.065373000-0500
> 
> which corresponds within about 60ms of the time that provider-deef 
> reported the job as active.
> However, the execution according to the wrapper log shows that the job did 
> not finish executing until
> 
> 00:14:19.233438000-0500
> 
> (which is when the status file is approximately timestamped).
> 
> My off-the-cuff hypothesis is, based on the above, that soemwhere in 
> provider-deef or below, the execution system is reporting a job as 
> completed as soon as it starts executing, rather than when it actually 
> finishes executing; and that successes with small numbers of jobs have 
> been a race condition that would disappear if those small jobs took a 
> substantially longer time to execute (eg if they had a sleep 30s in them).
> 


From benc at hawaga.org.uk  Wed Mar 19 11:31:18 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 19 Mar 2008 16:31:18 +0000 (GMT)
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <47E1352E.3070808@mcs.anl.gov>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
	<47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov>
	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>
	<47E1352E.3070808@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0803191630540.28951@dildano.hawaga.org.uk>


On Wed, 19 Mar 2008, Michael Wilde wrote:

> run315: 100 jobs, 110 workers: ~80% failed

Do you have the falkon logs for this run?

-- 


From benc at hawaga.org.uk  Wed Mar 19 11:53:39 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 19 Mar 2008 16:53:39 +0000 (GMT)
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <47E1352E.3070808@mcs.anl.gov>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
	<47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov>
	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>
	<47E1352E.3070808@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0803191651500.28951@dildano.hawaga.org.uk>

a brief look at run315 shows much closer/overlapping times that are more 
indicative of something funny at the filesystem level than yesterdays 
logs.

in run316, where did you put the sleep? in the application code or in the 
wrapper script?

-- 


From iraicu at cs.uchicago.edu  Wed Mar 19 12:09:25 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Wed, 19 Mar 2008 12:09:25 -0500
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <47E1352E.3070808@mcs.anl.gov>
References: <47DDBFD7.2050700@mcs.anl.gov>	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>	<47DE676A.8090604@cs.uchicago.edu>	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>	<47DECC42.4020500@mcs.anl.gov>	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>	<47DED24B.8070900@mcs.anl.gov>
	<47DFCC33.30803@mcs.anl.gov>	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>
	<47E1352E.3070808@mcs.anl.gov>
Message-ID: <47E148C5.3090600@cs.uchicago.edu>

Mike,
The errors that occur are after the jobs execute, and Swift looks for 
the successful empty file for each job (in a unique dir per job), 
right?  If Swift were to rely solely on the exit code that Falkon 
returned for each job, would that not solve your immediate problem?  
That is not to say that a race condition might not happen elsewhere, but 
at least it would not happen in such a simple scenario, where you have 
no data dependencies.  For example, if job A (running on compute node X) 
outputs data 1, and job B (running on compute node Y) reads data 1, and 
Swift submits job B within milliseconds of job A's completion, its 
likely that job B might not find data 1 to read.  So, relying solely on 
Falkon exit codes could allow you to run your 100 jobs that have no data 
dependencies among each other just fine, and will push the race 
condition to workflows that do have some data dependencies between jobs.

Ben, Mihael, is this feasible, to use the Falkon exit code solely to 
determine the success or failure of a job?

Ioan

Michael Wilde wrote:
> Following up on Ben's request from the msg below:
>
> > My off-the-cuff hypothesis is, based on the above, that soemwhere in
> > provider-deef or below, the execution system is reporting a job as
> > completed as soon as it starts executing, rather than when it actually
> > finishes executing; and that successes with small numbers of jobs have
> > been a race condition that would disappear if those small jobs took a
> > substantially longer time to execute (eg if they had a sleep 30s in 
> them).
> >
>
> I tested the following:
>
> run314: 100 jobs, 10 workers: all finished OK
>
> run315: 100 jobs, 110 workers: ~80% failed
>
> run316: 100 jobs, 110 workers, sleep 30 in the app: all finished OK
>
> These are in ~benc/swift-logs/wilde. The workdirs are preserved on 
> bblogin/sico - I did not copy them because you need access to the msec 
> timestamps anyways.
>
> I can run these several times each to get more data before we assess 
> the hypothesis, but didnt have time yet. Let me know if thats needed.
>
> I'm cautiously leaning a bit more to the NFS-race theory. I would like 
> to test with scp data transfer.  Am also trying to get gridftp 
> compiled there with help from Raj.  Build is failing with gpt 
> problems, I think I need Ben or Charles on this.
>
> - Mike
>
>
> On 3/18/08 3:57 PM, Ben Clifford wrote:
>> I picked the first failed job in the log oyu sent. Job id 2qbcdypi.
>>
>> I assume that your submit host and the various machines involved have 
>> properly synchronised clocks, but I have not checked this beyond 
>> seeing that the machine I am logged into has the same time as my 
>> laptop. I have labelled the times taken from different system clocks 
>> with lettered clock domains just in case they are different.
>>
>> For this job, its running in thread 0-1-88.
>> The karajan level job submission goes through these states (in clock 
>> domain A)
>> 23:14:08,196-0600 Submitting
>> 23:14:08,204-0600 Submitted
>> 23:14:14,121-0600 Active
>> 23:14:14,121-0600 Completed
>>
>> Note that the last two - Active and Completed - are the same (within 
>> a millisecond)
>>
>> At 23:14:14,189-0600 Swift checks the job status and finds the 
>> success file is not found. (This timestamp is in clock domain A)
>>
>> So now I look at for the status file myself on the fd filesystem:
>>
>> $ ls --full-time 
>> /home/wilde/swiftwork/amps1-20080317-2314-nvn8x1p2/status/2/runam3-2pbcdypi-success 
>>
>> -rw-r--r-- 1 wilde mcsz 0 2008-03-18 00:14:19.202382966 -0500 
>> /home/wilde/swiftwork/amps1-20080317-2314-nvn8x1p2/status/2/runam3-2pbcdypi-success 
>>
>>
>> (this is in clock domain B)
>>
>> And see that the file does exist but is a full 5 seconds after the 
>> job was reported as successful by provider-deef.
>>
>> So now we can look in the info/ directory (next to the status 
>> directory) and get run time stamps or the jobs.
>>
>> According to the info log, the job begins running at: (in clock 
>> domain B again) at:
>>
>> 00:14:14.065373000-0500
>>
>> which corresponds within about 60ms of the time that provider-deef 
>> reported the job as active.
>> However, the execution according to the wrapper log shows that the 
>> job did not finish executing until
>>
>> 00:14:19.233438000-0500
>>
>> (which is when the status file is approximately timestamped).
>>
>> My off-the-cuff hypothesis is, based on the above, that soemwhere in 
>> provider-deef or below, the execution system is reporting a job as 
>> completed as soon as it starts executing, rather than when it 
>> actually finishes executing; and that successes with small numbers of 
>> jobs have been a race condition that would disappear if those small 
>> jobs took a substantially longer time to execute (eg if they had a 
>> sleep 30s in them).
>>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


From benc at hawaga.org.uk  Wed Mar 19 12:20:39 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 19 Mar 2008 17:20:39 +0000 (GMT)
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <47E148C5.3090600@cs.uchicago.edu>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
	<47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov>
	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>
	<47E1352E.3070808@mcs.anl.gov> <47E148C5.3090600@cs.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0803191719301.28951@dildano.hawaga.org.uk>


On Wed, 19 Mar 2008, Ioan Raicu wrote:

> Ben, Mihael, is this feasible, to use the Falkon exit code solely to determine
> the success or failure of a job?

It would be overspecialising for this particular case, I think; and 
doesn't solve whatever the fundamental problem is (which I think I 
probably agree with you now having seen todays results is a filesystem 
race / bad filesystem semantics).


From benc at hawaga.org.uk  Wed Mar 19 12:49:59 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 19 Mar 2008 17:49:59 +0000 (GMT)
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <47E1352E.3070808@mcs.anl.gov>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
	<47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov>
	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>
	<47E1352E.3070808@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0803191746080.28951@dildano.hawaga.org.uk>


On Wed, 19 Mar 2008, Michael Wilde wrote:

> I'm cautiously leaning a bit more to the NFS-race theory. I would like to test
> with scp data transfer.  Am also trying to get gridftp compiled there with
> help from Raj.  Build is failing with gpt problems, I think I need Ben or
> Charles on this.

If an underlying NFS race is the problem, using scp or gridftp won't cure 
that - it may, by virtue of adding latency, make the problem disppear 
most/all of the time, but that would be by virtue of slowing down access, 
not any actual fixing of the problem.

If you're deliberately introducing artificial delays eg by doing the 
above, there are probably simpler ways (such as hacking a delay into the 
wrapper script after doing the touch but before exiting)

-- 


From wilde at mcs.anl.gov  Wed Mar 19 13:26:47 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 19 Mar 2008 13:26:47 -0500
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <Pine.LNX.4.64.0803191651500.28951@dildano.hawaga.org.uk>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
	<47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov>
	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>
	<47E1352E.3070808@mcs.anl.gov>
	<Pine.LNX.4.64.0803191651500.28951@dildano.hawaga.org.uk>
Message-ID: <47E15AE7.2060609@mcs.anl.gov>


On 3/19/08 11:53 AM, Ben Clifford wrote:
> a brief look at run315 shows much closer/overlapping times that are more 
> indicative of something funny at the filesystem level than yesterdays 
> logs.
> 
> in run316, where did you put the sleep? in the application code or in the 
> wrapper script?
> 

the sleep was the very last statement in the runam3-sleep30 wrapper 
script. its the executable listed in tc.data


From wilde at mcs.anl.gov  Wed Mar 19 13:46:02 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 19 Mar 2008 13:46:02 -0500
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <Pine.LNX.4.64.0803191746080.28951@dildano.hawaga.org.uk>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
	<47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov>
	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>
	<47E1352E.3070808@mcs.anl.gov>
	<Pine.LNX.4.64.0803191746080.28951@dildano.hawaga.org.uk>
Message-ID: <47E15F6A.5080109@mcs.anl.gov>

I was not considering scp and gridftp to introduce artificial delays.
The purpose was two-fold:

1) eliminate need to run swift on a host that mounts the sicortex 
filesystem, as there is no good host that does on which we can run 
long-term.  (we are temporary guests on bblogin). This was the initial 
reason, before we knew of any problems.

2) for dealing with this race, I thought we could avoid any possible NFS 
race conditions by writing directly to the filesystem.  But I now 
realize that this wont necessarily help: the scp and gridftp *servers* 
would not be running on a host that locally mounts the filesystem, and 
the sicortex worker nodes do NFS mounts themselves.

My (likely outdated) understanding of NFS protocol was that its supposed 
to guarantee close-to-open coherence.  Meaning that if two clients want 
to access a file sequentially, and the writing client closes the file 
before the reading client opens the file, then NFS was supposed to 
ensure that the reader correctly saw the existence and content of the file.

If others agree that this should still be the case, then its worth 
looking at our code to make sure that this is the case.  If it wasnt, 
you'd think that more things would break, but perhaps Falkon exacerbates 
any problems in that area due to its low latency.

The race as far as I know is between the worker writing and moving 
result, info, and success status files, and the swift host seeing these, 
correct?

- Mike


On 3/19/08 12:49 PM, Ben Clifford wrote:
> On Wed, 19 Mar 2008, Michael Wilde wrote:
> 
>> I'm cautiously leaning a bit more to the NFS-race theory. I would like to test
>> with scp data transfer.  Am also trying to get gridftp compiled there with
>> help from Raj.  Build is failing with gpt problems, I think I need Ben or
>> Charles on this.
> 
> If an underlying NFS race is the problem, using scp or gridftp won't cure 
> that - it may, by virtue of adding latency, make the problem disppear 
> most/all of the time, but that would be by virtue of slowing down access, 
> not any actual fixing of the problem.
> 
> If you're deliberately introducing artificial delays eg by doing the 
> above, there are probably simpler ways (such as hacking a delay into the 
> wrapper script after doing the touch but before exiting)
> 


From hategan at mcs.anl.gov  Wed Mar 19 15:48:57 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 19 Mar 2008 15:48:57 -0500
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <47E148C5.3090600@cs.uchicago.edu>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
	<47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov>
	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>
	<47E1352E.3070808@mcs.anl.gov>  <47E148C5.3090600@cs.uchicago.edu>
Message-ID: <1205959738.3410.6.camel@blabla.mcs.anl.gov>

If there is a race condition, we need to find it and address it, because
it applies to more than just the status file.

Mihael

On Wed, 2008-03-19 at 12:09 -0500, Ioan Raicu wrote:
> Mike,
> The errors that occur are after the jobs execute, and Swift looks for 
> the successful empty file for each job (in a unique dir per job), 
> right?  If Swift were to rely solely on the exit code that Falkon 
> returned for each job, would that not solve your immediate problem?
> That is not to say that a race condition might not happen elsewhere, but 
> at least it would not happen in such a simple scenario, where you have 
> no data dependencies.  For example, if job A (running on compute node X) 
> outputs data 1, and job B (running on compute node Y) reads data 1, and 
> Swift submits job B within milliseconds of job A's completion, its 
> likely that job B might not find data 1 to read.  So, relying solely on 
> Falkon exit codes could allow you to run your 100 jobs that have no data 
> dependencies among each other just fine, and will push the race 
> condition to workflows that do have some data dependencies between jobs.
> 
> Ben, Mihael, is this feasible, to use the Falkon exit code solely to 
> determine the success or failure of a job?
> 
> Ioan
> 
> Michael Wilde wrote:
> > Following up on Ben's request from the msg below:
> >
> > > My off-the-cuff hypothesis is, based on the above, that soemwhere in
> > > provider-deef or below, the execution system is reporting a job as
> > > completed as soon as it starts executing, rather than when it actually
> > > finishes executing; and that successes with small numbers of jobs have
> > > been a race condition that would disappear if those small jobs took a
> > > substantially longer time to execute (eg if they had a sleep 30s in 
> > them).
> > >
> >
> > I tested the following:
> >
> > run314: 100 jobs, 10 workers: all finished OK
> >
> > run315: 100 jobs, 110 workers: ~80% failed
> >
> > run316: 100 jobs, 110 workers, sleep 30 in the app: all finished OK
> >
> > These are in ~benc/swift-logs/wilde. The workdirs are preserved on 
> > bblogin/sico - I did not copy them because you need access to the msec 
> > timestamps anyways.
> >
> > I can run these several times each to get more data before we assess 
> > the hypothesis, but didnt have time yet. Let me know if thats needed.
> >
> > I'm cautiously leaning a bit more to the NFS-race theory. I would like 
> > to test with scp data transfer.  Am also trying to get gridftp 
> > compiled there with help from Raj.  Build is failing with gpt 
> > problems, I think I need Ben or Charles on this.
> >
> > - Mike
> >
> >
> > On 3/18/08 3:57 PM, Ben Clifford wrote:
> >> I picked the first failed job in the log oyu sent. Job id 2qbcdypi.
> >>
> >> I assume that your submit host and the various machines involved have 
> >> properly synchronised clocks, but I have not checked this beyond 
> >> seeing that the machine I am logged into has the same time as my 
> >> laptop. I have labelled the times taken from different system clocks 
> >> with lettered clock domains just in case they are different.
> >>
> >> For this job, its running in thread 0-1-88.
> >> The karajan level job submission goes through these states (in clock 
> >> domain A)
> >> 23:14:08,196-0600 Submitting
> >> 23:14:08,204-0600 Submitted
> >> 23:14:14,121-0600 Active
> >> 23:14:14,121-0600 Completed
> >>
> >> Note that the last two - Active and Completed - are the same (within 
> >> a millisecond)
> >>
> >> At 23:14:14,189-0600 Swift checks the job status and finds the 
> >> success file is not found. (This timestamp is in clock domain A)
> >>
> >> So now I look at for the status file myself on the fd filesystem:
> >>
> >> $ ls --full-time 
> >> /home/wilde/swiftwork/amps1-20080317-2314-nvn8x1p2/status/2/runam3-2pbcdypi-success 
> >>
> >> -rw-r--r-- 1 wilde mcsz 0 2008-03-18 00:14:19.202382966 -0500 
> >> /home/wilde/swiftwork/amps1-20080317-2314-nvn8x1p2/status/2/runam3-2pbcdypi-success 
> >>
> >>
> >> (this is in clock domain B)
> >>
> >> And see that the file does exist but is a full 5 seconds after the 
> >> job was reported as successful by provider-deef.
> >>
> >> So now we can look in the info/ directory (next to the status 
> >> directory) and get run time stamps or the jobs.
> >>
> >> According to the info log, the job begins running at: (in clock 
> >> domain B again) at:
> >>
> >> 00:14:14.065373000-0500
> >>
> >> which corresponds within about 60ms of the time that provider-deef 
> >> reported the job as active.
> >> However, the execution according to the wrapper log shows that the 
> >> job did not finish executing until
> >>
> >> 00:14:19.233438000-0500
> >>
> >> (which is when the status file is approximately timestamped).
> >>
> >> My off-the-cuff hypothesis is, based on the above, that soemwhere in 
> >> provider-deef or below, the execution system is reporting a job as 
> >> completed as soon as it starts executing, rather than when it 
> >> actually finishes executing; and that successes with small numbers of 
> >> jobs have been a race condition that would disappear if those small 
> >> jobs took a substantially longer time to execute (eg if they had a 
> >> sleep 30s in them).
> >>
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> >
> 


From benc at hawaga.org.uk  Wed Mar 19 16:22:46 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 19 Mar 2008 21:22:46 +0000 (GMT)
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <47E15F6A.5080109@mcs.anl.gov>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
	<47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov>
	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>
	<47E1352E.3070808@mcs.anl.gov>
	<Pine.LNX.4.64.0803191746080.28951@dildano.hawaga.org.uk>
	<47E15F6A.5080109@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0803192116000.28951@dildano.hawaga.org.uk>


On Wed, 19 Mar 2008, Michael Wilde wrote:

> My (likely outdated) understanding of NFS protocol was that its supposed to
> guarantee close-to-open coherence.  Meaning that if two clients want to access
> a file sequentially, and the writing client closes the file before the reading
> client opens the file, then NFS was supposed to ensure that the reader
> correctly saw the existence and content of the file.

Right.

Linux NFS (but this is going back half a decade) had some problem there (I 
think that caused problems for GRAM2 somewhere, for example) though I do 
not remember the details; and it was also half a decade ago so has a good 
chance of being different now.

A quick google did not find anything that immediately applied.

I've also still not entirely ruled out a race somewhere in the 
falkon->provider-deef->swift stack reporting this.

> If others agree that this should still be the case, then its worth 
> looking at our code to make sure that this is the case.  If it wasnt, 
> you'd think that more things would break, but perhaps Falkon exacerbates 
> any problems in that area due to its low latency.

Indeed, the combination of falkon and local filesystem access is probably 
getting the time between touching the status file on one node and reading 
it on another down pretty low compared to other submission and file access 
protocols.

> The race as far as I know is between the worker writing and moving result,
> info, and success status files, and the swift host seeing these, correct?

That's what your logs look like today. But yesterday had different timings 
that suggested a different problem.

More runs of the kind that failed would be useful, along with the 
corresponding falkon logs that Ioan listed in a mail in this thread.

-- 


From benc at hawaga.org.uk  Wed Mar 19 22:42:46 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 20 Mar 2008 03:42:46 +0000 (GMT)
Subject: [Swift-devel] plan for 0.5 release
Message-ID: <Pine.LNX.4.64.0803200342000.28951@dildano.hawaga.org.uk>


There was a long long pause between swift 0.3 and swift 0.4; and 
consequently a bunch of bugs have been discovered. so I'd like to put out 
a 0.5 sometime in the next couple weeks to release those bugfixes.

after that, hoepfully I will manage to not wait so many months before 
releasing 0.6

-- 


From hategan at mcs.anl.gov  Thu Mar 20 17:07:23 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Thu, 20 Mar 2008 17:07:23 -0500
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <Pine.LNX.4.64.0803192116000.28951@dildano.hawaga.org.uk>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
	<47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov>
	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>
	<47E1352E.3070808@mcs.anl.gov>
	<Pine.LNX.4.64.0803191746080.28951@dildano.hawaga.org.uk>
	<47E15F6A.5080109@mcs.anl.gov>
	<Pine.LNX.4.64.0803192116000.28951@dildano.hawaga.org.uk>
Message-ID: <1206050843.4091.9.camel@blabla.mcs.anl.gov>


On Wed, 2008-03-19 at 21:22 +0000, Ben Clifford wrote:
> On Wed, 19 Mar 2008, Michael Wilde wrote:
> 
> > My (likely outdated) understanding of NFS protocol was that its supposed to
> > guarantee close-to-open coherence.  Meaning that if two clients want to access
> > a file sequentially, and the writing client closes the file before the reading
> > client opens the file, then NFS was supposed to ensure that the reader
> > correctly saw the existence and content of the file.
> 
> Right.
> 
> Linux NFS (but this is going back half a decade) had some problem there (I 
> think that caused problems for GRAM2 somewhere, for example) though I do 
> not remember the details; and it was also half a decade ago so has a good 
> chance of being different now.

I seem to remember what looked like an oddity at the time, that the GRAM
PBS script was writing a file on the worker node and insisted that the
script (and the job) be "done" only when the file was visible on the
head node.

> 
> A quick google did not find anything that immediately applied.
> 
> I've also still not entirely ruled out a race somewhere in the 
> falkon->provider-deef->swift stack reporting this.
> 
> > If others agree that this should still be the case, then its worth 
> > looking at our code to make sure that this is the case.  If it wasnt, 
> > you'd think that more things would break, but perhaps Falkon exacerbates 
> > any problems in that area due to its low latency.
> 
> Indeed, the combination of falkon and local filesystem access is probably 
> getting the time between touching the status file on one node and reading 
> it on another down pretty low compared to other submission and file access 
> protocols.
> 
> > The race as far as I know is between the worker writing and moving result,
> > info, and success status files, and the swift host seeing these, correct?
> 
> That's what your logs look like today. But yesterday had different timings 
> that suggested a different problem.
> 
> More runs of the kind that failed would be useful, along with the 
> corresponding falkon logs that Ioan listed in a mail in this thread.
> 


From iraicu at cs.uchicago.edu  Thu Mar 20 17:44:18 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Thu, 20 Mar 2008 17:44:18 -0500
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <1206050843.4091.9.camel@blabla.mcs.anl.gov>
References: <47DDBFD7.2050700@mcs.anl.gov>	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>	<47DE676A.8090604@cs.uchicago.edu>	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>	<47DECC42.4020500@mcs.anl.gov>	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>	<47DED24B.8070900@mcs.anl.gov>
	<47DFCC33.30803@mcs.anl.gov>	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>	<47E1352E.3070808@mcs.anl.gov>	<Pine.LNX.4.64.0803191746080.28951@dildano.hawaga.org.uk>	<47E15F6A.5080109@mcs.anl.gov>	<Pine.LNX.4.64.0803192116000.28951@dildano.hawaga.org.uk>
	<1206050843.4091.9.camel@blabla.mcs.anl.gov>
Message-ID: <47E2E8C2.2050700@cs.uchicago.edu>

If GRAM handles the stagin in and out of data, then its true.  Falkon in 
the way that Swift is using it now does not do any data staging, so I 
don't see how Falkon can do any further  checking on the existence of 
files, on behalf of jobs.  What file would it check for?  This would 
surely involve modifying the API in the falkon provider code, for Swift 
to tell Falkon what file it needs to verify. 

If Falkon were to handle the data management, then you are right, Falkon 
would do all this checking, but currently it just treats Swift jobs as 
black boxes, and knows nothing about files or directories that need to 
exist.  Furthermore, the Falkon service could run anywhere (given that 
firewalls and NATs permit), which further complicates any kind of 
checking for files on some remote file system. 

Why could Swift not have a retry mechanism, given that it received a 
successful exit code, be more persistent in looking for the success or 
failure file, and if it doesn't exist, to try it again after some small 
amount of sleep...  this would certainly hide (and potentially solve) 
the race condition, with a persisitent enough retry mechanism, wouldn't it?

Ioan

Mihael Hategan wrote:
> On Wed, 2008-03-19 at 21:22 +0000, Ben Clifford wrote:
>   
>> On Wed, 19 Mar 2008, Michael Wilde wrote:
>>
>>     
>>> My (likely outdated) understanding of NFS protocol was that its supposed to
>>> guarantee close-to-open coherence.  Meaning that if two clients want to access
>>> a file sequentially, and the writing client closes the file before the reading
>>> client opens the file, then NFS was supposed to ensure that the reader
>>> correctly saw the existence and content of the file.
>>>       
>> Right.
>>
>> Linux NFS (but this is going back half a decade) had some problem there (I 
>> think that caused problems for GRAM2 somewhere, for example) though I do 
>> not remember the details; and it was also half a decade ago so has a good 
>> chance of being different now.
>>     
>
> I seem to remember what looked like an oddity at the time, that the GRAM
> PBS script was writing a file on the worker node and insisted that the
> script (and the job) be "done" only when the file was visible on the
> head node.
>
>   
>> A quick google did not find anything that immediately applied.
>>
>> I've also still not entirely ruled out a race somewhere in the 
>> falkon->provider-deef->swift stack reporting this.
>>
>>     
>>> If others agree that this should still be the case, then its worth 
>>> looking at our code to make sure that this is the case.  If it wasnt, 
>>> you'd think that more things would break, but perhaps Falkon exacerbates 
>>> any problems in that area due to its low latency.
>>>       
>> Indeed, the combination of falkon and local filesystem access is probably 
>> getting the time between touching the status file on one node and reading 
>> it on another down pretty low compared to other submission and file access 
>> protocols.
>>
>>     
>>> The race as far as I know is between the worker writing and moving result,
>>> info, and success status files, and the swift host seeing these, correct?
>>>       
>> That's what your logs look like today. But yesterday had different timings 
>> that suggested a different problem.
>>
>> More runs of the kind that failed would be useful, along with the 
>> corresponding falkon logs that Ioan listed in a mail in this thread.
>>
>>     
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080320/48fd6b59/attachment.html>

From benc at hawaga.org.uk  Thu Mar 20 18:17:47 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 20 Mar 2008 23:17:47 +0000 (GMT)
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <47E2E8C2.2050700@cs.uchicago.edu>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
	<47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov>
	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>
	<47E1352E.3070808@mcs.anl.gov>
	<Pine.LNX.4.64.0803191746080.28951@dildano.hawaga.org.uk>
	<47E15F6A.5080109@mcs.anl.gov>
	<Pine.LNX.4.64.0803192116000.28951@dildano.hawaga.org.uk>
	<1206050843.4091.9.camel@blabla.mcs.anl.gov>
	<47E2E8C2.2050700@cs.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0803202315170.28951@dildano.hawaga.org.uk>


On Thu, 20 Mar 2008, Ioan Raicu wrote:

> Why could Swift not have a retry mechanism, given that it received a
> successful exit code, be more persistent in looking for the success or failure
> file, and if it doesn't exist, to try it again after some small amount of
> sleep...  this would certainly hide (and potentially solve) the race
> condition, with a persisitent enough retry mechanism, wouldn't it?

The goal is not just to find a status file; there is other stuff beign 
written to the shared filesystem and its not clear that the status files 
appearing would guarantee that the other files had appeared too.

-- 


From benc at hawaga.org.uk  Thu Mar 20 18:23:06 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 20 Mar 2008 23:23:06 +0000 (GMT)
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <Pine.LNX.4.64.0803192116000.28951@dildano.hawaga.org.uk>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
	<47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov>
	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>
	<47E1352E.3070808@mcs.anl.gov>
	<Pine.LNX.4.64.0803191746080.28951@dildano.hawaga.org.uk>
	<47E15F6A.5080109@mcs.anl.gov>
	<Pine.LNX.4.64.0803192116000.28951@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0803202318120.28951@dildano.hawaga.org.uk>


There is flag for NFS mounts, 'noac', which disables attribute caching on 
clients, which I think may make the fielsystem behave in the desired 
fashion; however it sounds like it also massively reduces filesystem 
performance and fileserver load.

Mike, you might be able to persuade MCS systems to make such a filesystem 
available.

I suspect some multi-second delay after touching the status file and 
before exiting in the wrapper script is probably the best workaround for 
now, though.

-- 


From iraicu at cs.uchicago.edu  Thu Mar 20 18:26:39 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Thu, 20 Mar 2008 18:26:39 -0500
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <Pine.LNX.4.64.0803202315170.28951@dildano.hawaga.org.uk>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
	<47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov>
	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>
	<47E1352E.3070808@mcs.anl.gov>
	<Pine.LNX.4.64.0803191746080.28951@dildano.hawaga.org.uk>
	<47E15F6A.5080109@mcs.anl.gov>
	<Pine.LNX.4.64.0803192116000.28951@dildano.hawaga.org.uk>
	<1206050843.4091.9.camel@blabla.mcs.anl.gov>
	<47E2E8C2.2050700@cs.uchicago.edu>
	<Pine.LNX.4.64.0803202315170.28951@dildano.hawaga.org.uk>
Message-ID: <47E2F2AF.9090601@cs.uchicago.edu>

But the status file is written last, all from the same node, so in 
theory (would have to be tested, or at least verified by someone who 
knows NFS better than I do), if the status file appears, then the other 
files would also be there.  A year ago, there was no status file... this 
was added later.  What was the main motivator for adding the status 
file?  Was is that you couldn't rely on the provider's exit codes?  Or 
something else?

Ioan

Ben Clifford wrote:
> On Thu, 20 Mar 2008, Ioan Raicu wrote:
>
>   
>> Why could Swift not have a retry mechanism, given that it received a
>> successful exit code, be more persistent in looking for the success or failure
>> file, and if it doesn't exist, to try it again after some small amount of
>> sleep...  this would certainly hide (and potentially solve) the race
>> condition, with a persisitent enough retry mechanism, wouldn't it?
>>     
>
> The goal is not just to find a status file; there is other stuff beign 
> written to the shared filesystem and its not clear that the status files 
> appearing would guarantee that the other files had appeared too.
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080320/d135d708/attachment.html>

From benc at hawaga.org.uk  Thu Mar 20 18:29:13 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Thu, 20 Mar 2008 23:29:13 +0000 (GMT)
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <47E2F2AF.9090601@cs.uchicago.edu>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
	<47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov>
	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>
	<47E1352E.3070808@mcs.anl.gov>
	<Pine.LNX.4.64.0803191746080.28951@dildano.hawaga.org.uk>
	<47E15F6A.5080109@mcs.anl.gov>
	<Pine.LNX.4.64.0803192116000.28951@dildano.hawaga.org.uk>
	<1206050843.4091.9.camel@blabla.mcs.anl.gov>
	<47E2E8C2.2050700@cs.uchicago.edu>
	<Pine.LNX.4.64.0803202315170.28951@dildano.hawaga.org.uk>
	<47E2F2AF.9090601@cs.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0803202327580.28951@dildano.hawaga.org.uk>


If there is no status file and we rely on falkon reporting success; 
then we go to retrieve the last data file that was written out by the job, 
and 'oh! filesystem race condition, it isn't there...' for the same 
reasons that the status file isn't there now.

On Thu, 20 Mar 2008, Ioan Raicu wrote:

> But the status file is written last, all from the same node, so in theory
> (would have to be tested, or at least verified by someone who knows NFS better
> than I do), if the status file appears, then the other files would also be
> there.  A year ago, there was no status file... this was added later.  What
> was the main motivator for adding the status file?  Was is that you couldn't
> rely on the provider's exit codes?  Or something else?
> 
> Ioan
> 
> Ben Clifford wrote:
> > On Thu, 20 Mar 2008, Ioan Raicu wrote:
> > 
> >   
> > > Why could Swift not have a retry mechanism, given that it received a
> > > successful exit code, be more persistent in looking for the success or
> > > failure
> > > file, and if it doesn't exist, to try it again after some small amount of
> > > sleep...  this would certainly hide (and potentially solve) the race
> > > condition, with a persisitent enough retry mechanism, wouldn't it?
> > >     
> > 
> > The goal is not just to find a status file; there is other stuff beign
> > written to the shared filesystem and its not clear that the status files
> > appearing would guarantee that the other files had appeared too.
> > 
> >   
> 
> 


From iraicu at cs.uchicago.edu  Thu Mar 20 18:44:25 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Thu, 20 Mar 2008 18:44:25 -0500
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <Pine.LNX.4.64.0803202318120.28951@dildano.hawaga.org.uk>
References: <47DDBFD7.2050700@mcs.anl.gov>	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>	<47DE676A.8090604@cs.uchicago.edu>	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>	<47DECC42.4020500@mcs.anl.gov>	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>	<47DED24B.8070900@mcs.anl.gov>
	<47DFCC33.30803@mcs.anl.gov>	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>	<47E1352E.3070808@mcs.anl.gov>	<Pine.LNX.4.64.0803191746080.28951@dildano.hawaga.org.uk>	<47E15F6A.5080109@mcs.anl.gov>	<Pine.LNX.4.64.0803192116000.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803202318120.28951@dildano.hawaga.org.uk>
Message-ID: <47E2F6D9.2030007@cs.uchicago.edu>

I added a configurable delay in delivering notifications to Swift in the 
provider code, which Mike still has to test.  The deef provider already 
had a queue for the incoming notifications, so it was not hard to delay 
these notifications from this queue to Swift.

Another approach, which I discussed with Mike, was to do a sync at the 
end of the wrapper script.  From my simple test on a linux box, it seems 
that sync is a blocking call, which is exactly what we want!
iraicu at gto:~> time sync
real    0m0.711s
user    0m0.000s
sys     0m0.004s
iraicu at gto:~> time sync
real    0m0.035s
user    0m0.000s
sys     0m0.000s

Mike, could you try adding a sync at the end of the wrapper.sh (and make 
sure to not have any additional sleeps anywhere else), and see if that 
helps?

Ioan


Ben Clifford wrote:
> There is flag for NFS mounts, 'noac', which disables attribute caching on 
> clients, which I think may make the fielsystem behave in the desired 
> fashion; however it sounds like it also massively reduces filesystem 
> performance and fileserver load.
>
> Mike, you might be able to persuade MCS systems to make such a filesystem 
> available.
>
> I suspect some multi-second delay after touching the status file and 
> before exiting in the wrapper script is probably the best workaround for 
> now, though.
>
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


From iraicu at cs.uchicago.edu  Thu Mar 20 19:03:24 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Thu, 20 Mar 2008 19:03:24 -0500
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <47E2F6D9.2030007@cs.uchicago.edu>
References: <47DDBFD7.2050700@mcs.anl.gov>	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>	<47DE676A.8090604@cs.uchicago.edu>	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>	<47DECC42.4020500@mcs.anl.gov>	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>	<47DED24B.8070900@mcs.anl.gov>	<47DFCC33.30803@mcs.anl.gov>	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>	<47E1352E.3070808@mcs.anl.gov>	<Pine.LNX.4.64.0803191746080.28951@dildano.hawaga.org.uk>	<47E15F6A.5080109@mcs.anl.gov>	<Pine.LNX.4.64.0803192116000.28951@dildano.hawaga.org.uk>	<Pine.LNX.4.64.0803202318120.28951@dildano.hawaga.org.uk>
	<47E2F6D9.2030007@cs.uchicago.edu>
Message-ID: <47E2FB4C.50406@cs.uchicago.edu>

Here is more on sync:
> According to the standard specification (e.g., POSIX.1-2001), sync() 
> schedules the writes, but  may  return  before  the
>        actual  writing  is  done.  However, since version 1.3.20 Linux 
> does actually wait.  (This still does not guarantee data
>        integrity: modern disks have large caches.)
So, it looks like it might be blocking, but it might depend on the Linux 
kernel.  Anyways, I think its worth a try, and it seems like a better 
solution than sleeps.

Ioan

Ioan Raicu wrote:
> I added a configurable delay in delivering notifications to Swift in 
> the provider code, which Mike still has to test.  The deef provider 
> already had a queue for the incoming notifications, so it was not hard 
> to delay these notifications from this queue to Swift.
>
> Another approach, which I discussed with Mike, was to do a sync at the 
> end of the wrapper script.  From my simple test on a linux box, it 
> seems that sync is a blocking call, which is exactly what we want!
> iraicu at gto:~> time sync
> real    0m0.711s
> user    0m0.000s
> sys     0m0.004s
> iraicu at gto:~> time sync
> real    0m0.035s
> user    0m0.000s
> sys     0m0.000s
>
> Mike, could you try adding a sync at the end of the wrapper.sh (and 
> make sure to not have any additional sleeps anywhere else), and see if 
> that helps?
>
> Ioan
>
>
> Ben Clifford wrote:
>> There is flag for NFS mounts, 'noac', which disables attribute 
>> caching on clients, which I think may make the fielsystem behave in 
>> the desired fashion; however it sounds like it also massively reduces 
>> filesystem performance and fileserver load.
>>
>> Mike, you might be able to persuade MCS systems to make such a 
>> filesystem available.
>>
>> I suspect some multi-second delay after touching the status file and 
>> before exiting in the wrapper script is probably the best workaround 
>> for now, though.
>>
>>   
>

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


From benc at hawaga.org.uk  Thu Mar 20 19:14:42 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 21 Mar 2008 00:14:42 +0000 (GMT)
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <47E2F6D9.2030007@cs.uchicago.edu>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
	<47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov>
	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>
	<47E1352E.3070808@mcs.anl.gov>
	<Pine.LNX.4.64.0803191746080.28951@dildano.hawaga.org.uk>
	<47E15F6A.5080109@mcs.anl.gov>
	<Pine.LNX.4.64.0803192116000.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803202318120.28951@dildano.hawaga.org.uk>
	<47E2F6D9.2030007@cs.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0803210013140.28951@dildano.hawaga.org.uk>


On Thu, 20 Mar 2008, Ioan Raicu wrote:

> Mike, could you try adding a sync at the end of the wrapper.sh (and make sure
> to not have any additional sleeps anywhere else), and see if that helps?

yeah, that's a good thing to try.

i don't really know how sync works wrt NFS, but its worth trying.

-- 


From hategan at mcs.anl.gov  Fri Mar 21 03:18:16 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 21 Mar 2008 03:18:16 -0500
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <47E2E8C2.2050700@cs.uchicago.edu>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
	<47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov>
	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>
	<47E1352E.3070808@mcs.anl.gov>
	<Pine.LNX.4.64.0803191746080.28951@dildano.hawaga.org.uk>
	<47E15F6A.5080109@mcs.anl.gov>
	<Pine.LNX.4.64.0803192116000.28951@dildano.hawaga.org.uk>
	<1206050843.4091.9.camel@blabla.mcs.anl.gov>
	<47E2E8C2.2050700@cs.uchicago.edu>
Message-ID: <1206087497.4572.3.camel@blabla.mcs.anl.gov>


On Thu, 2008-03-20 at 17:44 -0500, Ioan Raicu wrote:
> If GRAM handles the stagin in and out of data, then its true.

No, it's true because that's what GRAM scripts do.

>   Falkon in the way that Swift is using it now does not do any data
> staging, so I don't see how Falkon can do any further  checking on the
> existence of files, on behalf of jobs.  What file would it check for? 

Pretty much the file that GRAM checks for: one that it creates after the
executable completes. If the filesystem preserves temporal ordering on
file availability, then this will guarantee that any files created by
the job will be visible.

>  This would surely involve modifying the API in the falkon provider
> code, for Swift to tell Falkon what file it needs to verify.  
> 
> If Falkon were to handle the data management, then you are right,
> Falkon would do all this checking, but currently it just treats Swift
> jobs as black boxes, and knows nothing about files or directories that
> need to exist.  Furthermore, the Falkon service could run anywhere
> (given that firewalls and NATs permit), which further complicates any
> kind of checking for files on some remote file system.  
> 
> Why could Swift not have a retry mechanism, given that it received a
> successful exit code, be more persistent in looking for the success or
> failure file, and if it doesn't exist, to try it again after some
> small amount of sleep...  this would certainly hide (and potentially
> solve) the race condition, with a persisitent enough retry mechanism,
> wouldn't it?
> 
> Ioan
> 
> Mihael Hategan wrote: 
> > On Wed, 2008-03-19 at 21:22 +0000, Ben Clifford wrote:
> >   
> > > On Wed, 19 Mar 2008, Michael Wilde wrote:
> > > 
> > >     
> > > > My (likely outdated) understanding of NFS protocol was that its supposed to
> > > > guarantee close-to-open coherence.  Meaning that if two clients want to access
> > > > a file sequentially, and the writing client closes the file before the reading
> > > > client opens the file, then NFS was supposed to ensure that the reader
> > > > correctly saw the existence and content of the file.
> > > >       
> > > Right.
> > > 
> > > Linux NFS (but this is going back half a decade) had some problem there (I 
> > > think that caused problems for GRAM2 somewhere, for example) though I do 
> > > not remember the details; and it was also half a decade ago so has a good 
> > > chance of being different now.
> > >     
> > 
> > I seem to remember what looked like an oddity at the time, that the GRAM
> > PBS script was writing a file on the worker node and insisted that the
> > script (and the job) be "done" only when the file was visible on the
> > head node.
> > 
> >   
> > > A quick google did not find anything that immediately applied.
> > > 
> > > I've also still not entirely ruled out a race somewhere in the 
> > > falkon->provider-deef->swift stack reporting this.
> > > 
> > >     
> > > > If others agree that this should still be the case, then its worth 
> > > > looking at our code to make sure that this is the case.  If it wasnt, 
> > > > you'd think that more things would break, but perhaps Falkon exacerbates 
> > > > any problems in that area due to its low latency.
> > > >       
> > > Indeed, the combination of falkon and local filesystem access is probably 
> > > getting the time between touching the status file on one node and reading 
> > > it on another down pretty low compared to other submission and file access 
> > > protocols.
> > > 
> > >     
> > > > The race as far as I know is between the worker writing and moving result,
> > > > info, and success status files, and the swift host seeing these, correct?
> > > >       
> > > That's what your logs look like today. But yesterday had different timings 
> > > that suggested a different problem.
> > > 
> > > More runs of the kind that failed would be useful, along with the 
> > > corresponding falkon logs that Ioan listed in a mail in this thread.
> > > 
> > >     
> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> >   
> 
> -- 
> ===================================================
> Ioan Raicu
> Ph.D. Candidate
> ===================================================
> Distributed Systems Laboratory
> Computer Science Department
> University of Chicago
> 1100 E. 58th Street, Ryerson Hall
> Chicago, IL 60637
> ===================================================
> Email: iraicu at cs.uchicago.edu
> Web:   http://www.cs.uchicago.edu/~iraicu
> http://dev.globus.org/wiki/Incubator/Falkon
> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
> ===================================================
> ===================================================
> 


From hategan at mcs.anl.gov  Fri Mar 21 03:21:25 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 21 Mar 2008 03:21:25 -0500
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <47E2F2AF.9090601@cs.uchicago.edu>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
	<47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov>
	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>
	<47E1352E.3070808@mcs.anl.gov>
	<Pine.LNX.4.64.0803191746080.28951@dildano.hawaga.org.uk>
	<47E15F6A.5080109@mcs.anl.gov>
	<Pine.LNX.4.64.0803192116000.28951@dildano.hawaga.org.uk>
	<1206050843.4091.9.camel@blabla.mcs.anl.gov>
	<47E2E8C2.2050700@cs.uchicago.edu>
	<Pine.LNX.4.64.0803202315170.28951@dildano.hawaga.org.uk>
	<47E2F2AF.9090601@cs.uchicago.edu>
Message-ID: <1206087685.4572.7.camel@blabla.mcs.anl.gov>


On Thu, 2008-03-20 at 18:26 -0500, Ioan Raicu wrote:
> But the status file is written last, all from the same node, so in
> theory (would have to be tested, or at least verified by someone who
> knows NFS better than I do), if the status file appears, then the
> other files would also be there.  A year ago, there was no status
> file... this was added later.

Your assumption is incorrect. There was an exit code file written when
the application failed, but nothing written when the application
succeeded, causing ambiguity when the filesystem settings were wrong.

>   What was the main motivator for adding the status file?  Was is that
> you couldn't rely on the provider's exit codes?  Or something else?
> 
> Ioan
> 
> Ben Clifford wrote: 
> > On Thu, 20 Mar 2008, Ioan Raicu wrote:
> > 
> >   
> > > Why could Swift not have a retry mechanism, given that it received a
> > > successful exit code, be more persistent in looking for the success or failure
> > > file, and if it doesn't exist, to try it again after some small amount of
> > > sleep...  this would certainly hide (and potentially solve) the race
> > > condition, with a persisitent enough retry mechanism, wouldn't it?
> > >     
> > 
> > The goal is not just to find a status file; there is other stuff beign 
> > written to the shared filesystem and its not clear that the status files 
> > appearing would guarantee that the other files had appeared too.
> > 
> >   
> 
> -- 
> ===================================================
> Ioan Raicu
> Ph.D. Candidate
> ===================================================
> Distributed Systems Laboratory
> Computer Science Department
> University of Chicago
> 1100 E. 58th Street, Ryerson Hall
> Chicago, IL 60637
> ===================================================
> Email: iraicu at cs.uchicago.edu
> Web:   http://www.cs.uchicago.edu/~iraicu
> http://dev.globus.org/wiki/Incubator/Falkon
> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
> ===================================================
> ===================================================
> 


From wilde at mcs.anl.gov  Fri Mar 21 07:12:03 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 21 Mar 2008 07:12:03 -0500
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <Pine.LNX.4.64.0803202318120.28951@dildano.hawaga.org.uk>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
	<47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov>
	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>
	<47E1352E.3070808@mcs.anl.gov>
	<Pine.LNX.4.64.0803191746080.28951@dildano.hawaga.org.uk>
	<47E15F6A.5080109@mcs.anl.gov>
	<Pine.LNX.4.64.0803192116000.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803202318120.28951@dildano.hawaga.org.uk>
Message-ID: <47E3A613.2040701@mcs.anl.gov>

My latest test on runs of 25, 100, and 1000 jobs seem to indicate that
with a sync command at the end of the application script, all job status
and data is returned ok every time.

(This is somewhat curious, as the info and success files fur the current
would not yet be complete at the time, but the sync command effects all
other activity on the host, and ensures that at least the currently
existing dirs, files and data are synced, or that their sync has started).

Without the sync, at the moment, virtually all jobs fail, and almost
*no* data is being returned.  Out of 3 runs of 1000 jobs, one run
returned 2 data files, the other two returned no data files. One 100-job
run without sync returned 11 of 100 files.

It seems like the most fruitful testing to see if this sync is totally
fixing the problem is to do lots more runs.

I noted that the bblog host (from which I run Swift) has no special NFS
mount flags, just rw. (I was wondering if they had something on that
would affect coherence; seems not).

I did not have a chance to capture the falkon logs in these tests; I
will look for the ones Ioan mentioned, and try some runs with those logs
captured.

The swift logs I did capture are in the CI log dir, wilde/run{317-328}

run317/comment:amps1 100 sico with sync - ran ok
run318/comment:amps1 100 with no sync - died on first error
run319/comment:amps1 without sync - 11 of 100 returned OK
run320/comment:amps1 100 without sync - no data returned ok
run321/comment:amps1 100 without sync - no data returned ok
run322/comment:amps1 100 with sync - all data returned ok
run323/comment:amps1 100 with sync - all data returned ok
run324/comment:amps1 1000 with sync - all data returned ok
run325/comment:amps1 1000 without sync - no data returned ok
run326/comment:amps1 25 without sync - no data returned ok
run327/comment:amps1 100 without sync - 2 data files returned ok
run328/comment:amps1 1000 with sync - all data returned ok

- Mike

On 3/20/08 6:23 PM, Ben Clifford wrote:
> There is flag for NFS mounts, 'noac', which disables attribute caching on 
> clients, which I think may make the fielsystem behave in the desired 
> fashion; however it sounds like it also massively reduces filesystem 
> performance and fileserver load.
> 
> Mike, you might be able to persuade MCS systems to make such a filesystem 
> available.
> 
> I suspect some multi-second delay after touching the status file and 
> before exiting in the wrapper script is probably the best workaround for 
> now, though.
> 


From wilde at mcs.anl.gov  Fri Mar 21 08:34:43 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Fri, 21 Mar 2008 08:34:43 -0500
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <47E3A613.2040701@mcs.anl.gov>
References: <47DDBFD7.2050700@mcs.anl.gov>	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>	<47DE676A.8090604@cs.uchicago.edu>	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>	<47DECC42.4020500@mcs.anl.gov>	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>	<47DED24B.8070900@mcs.anl.gov>
	<47DFCC33.30803@mcs.anl.gov>	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>	<47E1352E.3070808@mcs.anl.gov>	<Pine.LNX.4.64.0803191746080.28951@dildano.hawaga.org.uk>	<47E15F6A.5080109@mcs.anl.gov>	<Pine.LNX.4.64.0803192116000.28951@dildano.hawaga.org.uk>	<Pine.LNX.4.64.0803202318120.28951@dildano.hawaga.org.uk>
	<47E3A613.2040701@mcs.anl.gov>
Message-ID: <47E3B973.6040503@mcs.anl.gov>

Runs 329 and 330 (both in the CI log dir) were run with (I hope) the 
requested Falkon logs turned on. Note that I turned the deef provider 
logs on too, but did not yet verify that it was correctly logging.

run329 was 9 jobs, no sync. All 9 succeed.
run330 was 25 jobs, no sync. 19 of 25 succeeded, the rest failed.

This is starting to confirm a curious pattern: without the sync, 
workflows with more jobs achieve *less* total sucessful jobs.
Here's what I recall from the last few days of testing:

    1 job wf: all succeeds
    9 job wf: al  succeed
   25 job wf: 15-20 succeed
  100 job wf: 1-2 succeed
1000 job wf: 0 succeed

I dont have enough data to confirm this, but the pattern seems to be 
present.

I am going to set the problem aside for now, until, Ben and Ioan, you 
have a chance to look at the logs from this morning's test.

I'll assume for the moment that the sync "fixes" the problem, and go on 
to the application tests I need to run, keeping an eye out for anomalies.

My goal is to to large-scale tests of AMIGA and DOCK under Swift, 
reducing wrapper.sh and throttling delays, and doing as much work on 
local RAM filesystems as possible.

Mike


On 3/21/08 7:12 AM, Michael Wilde wrote:
> My latest test on runs of 25, 100, and 1000 jobs seem to indicate that
> with a sync command at the end of the application script, all job status
> and data is returned ok every time.
> 
> (This is somewhat curious, as the info and success files fur the current
> would not yet be complete at the time, but the sync command effects all
> other activity on the host, and ensures that at least the currently
> existing dirs, files and data are synced, or that their sync has started).
> 
> Without the sync, at the moment, virtually all jobs fail, and almost
> *no* data is being returned.  Out of 3 runs of 1000 jobs, one run
> returned 2 data files, the other two returned no data files. One 100-job
> run without sync returned 11 of 100 files.
> 
> It seems like the most fruitful testing to see if this sync is totally
> fixing the problem is to do lots more runs.
> 
> I noted that the bblog host (from which I run Swift) has no special NFS
> mount flags, just rw. (I was wondering if they had something on that
> would affect coherence; seems not).
> 
> I did not have a chance to capture the falkon logs in these tests; I
> will look for the ones Ioan mentioned, and try some runs with those logs
> captured.
> 
> The swift logs I did capture are in the CI log dir, wilde/run{317-328}
> 
> run317/comment:amps1 100 sico with sync - ran ok
> run318/comment:amps1 100 with no sync - died on first error
> run319/comment:amps1 without sync - 11 of 100 returned OK
> run320/comment:amps1 100 without sync - no data returned ok
> run321/comment:amps1 100 without sync - no data returned ok
> run322/comment:amps1 100 with sync - all data returned ok
> run323/comment:amps1 100 with sync - all data returned ok
> run324/comment:amps1 1000 with sync - all data returned ok
> run325/comment:amps1 1000 without sync - no data returned ok
> run326/comment:amps1 25 without sync - no data returned ok
> run327/comment:amps1 100 without sync - 2 data files returned ok
> run328/comment:amps1 1000 with sync - all data returned ok
> 
> - Mike
> 
> On 3/20/08 6:23 PM, Ben Clifford wrote:
>> There is flag for NFS mounts, 'noac', which disables attribute caching 
>> on clients, which I think may make the fielsystem behave in the 
>> desired fashion; however it sounds like it also massively reduces 
>> filesystem performance and fileserver load.
>>
>> Mike, you might be able to persuade MCS systems to make such a 
>> filesystem available.
>>
>> I suspect some multi-second delay after touching the status file and 
>> before exiting in the wrapper script is probably the best workaround 
>> for now, though.
>>
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> 


From hategan at mcs.anl.gov  Fri Mar 21 09:02:30 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Fri, 21 Mar 2008 09:02:30 -0500
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <47E3A613.2040701@mcs.anl.gov>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
	<47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov>
	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>
	<47E1352E.3070808@mcs.anl.gov>
	<Pine.LNX.4.64.0803191746080.28951@dildano.hawaga.org.uk>
	<47E15F6A.5080109@mcs.anl.gov>
	<Pine.LNX.4.64.0803192116000.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803202318120.28951@dildano.hawaga.org.uk>
	<47E3A613.2040701@mcs.anl.gov>
Message-ID: <1206108150.5100.0.camel@blabla.mcs.anl.gov>

On Fri, 2008-03-21 at 07:12 -0500, Michael Wilde wrote:
> My latest test on runs of 25, 100, and 1000 jobs seem to indicate that
> with a sync command at the end of the application script, all job status
> and data is returned ok every time.

Why not put it in the wrapper script at the end?

> 
> (This is somewhat curious, as the info and success files fur the current
> would not yet be complete at the time, but the sync command effects all
> other activity on the host, and ensures that at least the currently
> existing dirs, files and data are synced, or that their sync has started).
> 
> Without the sync, at the moment, virtually all jobs fail, and almost
> *no* data is being returned.  Out of 3 runs of 1000 jobs, one run
> returned 2 data files, the other two returned no data files. One 100-job
> run without sync returned 11 of 100 files.
> 
> It seems like the most fruitful testing to see if this sync is totally
> fixing the problem is to do lots more runs.
> 
> I noted that the bblog host (from which I run Swift) has no special NFS
> mount flags, just rw. (I was wondering if they had something on that
> would affect coherence; seems not).
> 
> I did not have a chance to capture the falkon logs in these tests; I
> will look for the ones Ioan mentioned, and try some runs with those logs
> captured.
> 
> The swift logs I did capture are in the CI log dir, wilde/run{317-328}
> 
> run317/comment:amps1 100 sico with sync - ran ok
> run318/comment:amps1 100 with no sync - died on first error
> run319/comment:amps1 without sync - 11 of 100 returned OK
> run320/comment:amps1 100 without sync - no data returned ok
> run321/comment:amps1 100 without sync - no data returned ok
> run322/comment:amps1 100 with sync - all data returned ok
> run323/comment:amps1 100 with sync - all data returned ok
> run324/comment:amps1 1000 with sync - all data returned ok
> run325/comment:amps1 1000 without sync - no data returned ok
> run326/comment:amps1 25 without sync - no data returned ok
> run327/comment:amps1 100 without sync - 2 data files returned ok
> run328/comment:amps1 1000 with sync - all data returned ok
> 
> - Mike
> 
> On 3/20/08 6:23 PM, Ben Clifford wrote:
> > There is flag for NFS mounts, 'noac', which disables attribute caching on 
> > clients, which I think may make the fielsystem behave in the desired 
> > fashion; however it sounds like it also massively reduces filesystem 
> > performance and fileserver load.
> > 
> > Mike, you might be able to persuade MCS systems to make such a filesystem 
> > available.
> > 
> > I suspect some multi-second delay after touching the status file and 
> > before exiting in the wrapper script is probably the best workaround for 
> > now, though.
> > 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From iraicu at cs.uchicago.edu  Fri Mar 21 09:38:15 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Fri, 21 Mar 2008 09:38:15 -0500
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <1206108150.5100.0.camel@blabla.mcs.anl.gov>
References: <47DDBFD7.2050700@mcs.anl.gov>	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>	<47DE676A.8090604@cs.uchicago.edu>	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>	<47DECC42.4020500@mcs.anl.gov>	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>	<47DED24B.8070900@mcs.anl.gov>
	<47DFCC33.30803@mcs.anl.gov>	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>	<47E1352E.3070808@mcs.anl.gov>	<Pine.LNX.4.64.0803191746080.28951@dildano.hawaga.org.uk>	<47E15F6A.5080109@mcs.anl.gov>	<Pine.LNX.4.64.0803192116000.28951@dildano.hawaga.org.uk>	<Pine.LNX.4.64.0803202318120.28951@dildano.hawaga.org.uk>	<47E3A613.2040701@mcs.anl.gov>
	<1206108150.5100.0.camel@blabla.mcs.anl.gov>
Message-ID: <47E3C857.604@cs.uchicago.edu>

I would also think that having the sync at the end of the wrapper.sh 
after it is done modifying any other files would be the best thing.

Ioan

Mihael Hategan wrote:
> On Fri, 2008-03-21 at 07:12 -0500, Michael Wilde wrote:
>   
>> My latest test on runs of 25, 100, and 1000 jobs seem to indicate that
>> with a sync command at the end of the application script, all job status
>> and data is returned ok every time.
>>     
>
> Why not put it in the wrapper script at the end?
>
>   
>

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080321/388775f6/attachment.html>

From benc at hawaga.org.uk  Fri Mar 21 09:50:22 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 21 Mar 2008 14:50:22 +0000 (GMT)
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <47E3B973.6040503@mcs.anl.gov>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
	<47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov>
	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>
	<47E1352E.3070808@mcs.anl.gov>
	<Pine.LNX.4.64.0803191746080.28951@dildano.hawaga.org.uk>
	<47E15F6A.5080109@mcs.anl.gov>
	<Pine.LNX.4.64.0803192116000.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803202318120.28951@dildano.hawaga.org.uk>
	<47E3A613.2040701@mcs.anl.gov> <47E3B973.6040503@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0803211449240.28951@dildano.hawaga.org.uk>


> This is starting to confirm a curious pattern: without the sync, workflows
> with more jobs achieve *less* total sucessful jobs.

I can imagine that happening - with more stuff going on, flush to NFS 
server happens less/slower, because other stuff is happening instead.

-- 


From benc at hawaga.org.uk  Sun Mar 23 19:12:05 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 24 Mar 2008 00:12:05 +0000 (GMT)
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <1206108150.5100.0.camel@blabla.mcs.anl.gov>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
	<47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> 
	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk> 
	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk> 
	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>
	<47E1352E.3070808@mcs.anl.gov>
	<Pine.LNX.4.64.0803191746080.28951@dildano.hawaga.org.uk> 
	<47E15F6A.5080109@mcs.anl.gov>
	<Pine.LNX.4.64.0803192116000.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803202318120.28951@dildano.hawaga.org.uk> 
	<47E3A613.2040701@mcs.anl.gov>
	<1206108150.5100.0.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0803240010010.28951@dildano.hawaga.org.uk>


On Fri, 21 Mar 2008, Mihael Hategan wrote:

> On Fri, 2008-03-21 at 07:12 -0500, Michael Wilde wrote:
> > My latest test on runs of 25, 100, and 1000 jobs seem to indicate that
> > with a sync command at the end of the application script, all job status
> > and data is returned ok every time.
> 
> Why not put it in the wrapper script at the end?

Mike, the attached patch will do that, and will also add logging 
information so that we can see how long syncs are taking compared to other 
stages in worker node execution.

cd cog/modules/vdsk
patch -p1 < sync-in-wrapper

-- 
-------------- next part --------------
Index: swift/libexec/wrapper.sh
===================================================================
--- swift.orig/libexec/wrapper.sh	2008-03-24 08:49:43.000000000 +0900
+++ swift/libexec/wrapper.sh	2008-03-24 08:49:45.000000000 +0900
@@ -240,5 +240,9 @@
 
 logstate "TOUCH_SUCCESS"
 touch status/${JOBDIR}/${ID}-success
+
+logstate SYNC
+sync
+
 logstate "END"
 

From wilde at mcs.anl.gov  Sun Mar 23 21:59:04 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 23 Mar 2008 21:59:04 -0500
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <Pine.LNX.4.64.0803240010010.28951@dildano.hawaga.org.uk>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
	<47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov>
	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>
	<47E1352E.3070808@mcs.anl.gov>
	<Pine.LNX.4.64.0803191746080.28951@dildano.hawaga.org.uk>
	<47E15F6A.5080109@mcs.anl.gov>
	<Pine.LNX.4.64.0803192116000.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803202318120.28951@dildano.hawaga.org.uk>
	<47E3A613.2040701@mcs.anl.gov>
	<1206108150.5100.0.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0803240010010.28951@dildano.hawaga.org.uk>
Message-ID: <47E718F8.6010800@mcs.anl.gov>

Ben, thanks.

Ive been debugging on this since Friday. I had already moved the sync 
into wrapper.sh when Mihael first mentioned it.

Friday afternoon I moved from a falkon binary drop that Ioan had built 
for me, to a build that I got from SVN and built myself.

When I did that the nature of the problem changed:
- first run after a falkon restart, with the sync in wrapper.sh, worked
fine, at various workflow sizes.
- second run would consistently fail with most jobs missing output, 
status and info files.

Turns out the data was going mostly into the previous workflow's workdir.

After much debugging, the problem was found to be bad message formatting 
in the falkon service, causing the chdir to the workdir to fail. It 
failed very seldom on the initial workflow, and heavily on subsequent 
ones. This problem, too, initially looked like NFS incoherence.

Since that was fixed, Ive been experimenting with workflows of various 
sizes, and have run several 10, 25, 100, 500, and 1000 job workflows, 
all without any sync, and without apparent problems.

Some mysteries remain, as its not clear that this message/chdir fix 
explains the earlier problem. But several Falkon fixes went in as well, 
so there's too many variables to know with confidence whether the 
original problem remains.

Ioan: I do see that we're loosing some workers, so some investigation is 
needed on the Falkon side.

Ben: the swift provenance log records seem excessive: I'll start a 
thread on that.

I'm now going to start performance measurement and tuning on this now 
that things seem stable enough to do repeatable runs.

- Mike


On 3/23/08 7:12 PM, Ben Clifford wrote:
> On Fri, 21 Mar 2008, Mihael Hategan wrote:
> 
>> On Fri, 2008-03-21 at 07:12 -0500, Michael Wilde wrote:
>>> My latest test on runs of 25, 100, and 1000 jobs seem to indicate that
>>> with a sync command at the end of the application script, all job status
>>> and data is returned ok every time.
>> Why not put it in the wrapper script at the end?
> 
> Mike, the attached patch will do that, and will also add logging 
> information so that we can see how long syncs are taking compared to other 
> stages in worker node execution.
> 
> cd cog/modules/vdsk
> patch -p1 < sync-in-wrapper
> 
> 


From benc at hawaga.org.uk  Sun Mar 23 22:03:04 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 24 Mar 2008 03:03:04 +0000 (GMT)
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <47E718F8.6010800@mcs.anl.gov>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
	<47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> 
	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk> 
	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk> 
	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>
	<47E1352E.3070808@mcs.anl.gov>
	<Pine.LNX.4.64.0803191746080.28951@dildano.hawaga.org.uk> 
	<47E15F6A.5080109@mcs.anl.gov>
	<Pine.LNX.4.64.0803192116000.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803202318120.28951@dildano.hawaga.org.uk> 
	<47E3A613.2040701@mcs.anl.gov>
	<1206108150.5100.0.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0803240010010.28951@dildano.hawaga.org.uk>
	<47E718F8.6010800@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0803240302360.28951@dildano.hawaga.org.uk>


On Sun, 23 Mar 2008, Michael Wilde wrote:

> Ben: the swift provenance log records seem excessive

which ones?

-- 


From wilde at mcs.anl.gov  Sun Mar 23 22:15:10 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Sun, 23 Mar 2008 22:15:10 -0500
Subject: [Swift-devel] Excessive object-closing messages in log
Message-ID: <47E71CBE.20000@mcs.anl.gov>

Ben, for a small swift script that iterates over a parameter array, I 
seem to be getting about (N^2)/2 log records regarding object closing.

The messages are of the form:

2008-03-23 18:30:36,877-0600 INFO  CloseDataset 
org.griphyn.vdl.mapping.RootDataNode identifier 
tag:benc at ci.uchicago.edu,2008:swift:dataset:20080323-1828-x14dlldb:720000003246 
with no value at dataset=ofile (closed)

For 500 entries, I had about 130K object closing log records; for 1000 
entries, over 500K:

bb$ grep -i -c closedataset run{341,342}/amps*.log
run341/amps1-20080323-1828-25xcue89.log:130259
run342/amps1-20080323-1935-su38n0k5.log:510509

341 is 500 jobs, 342 is 1000 jobs.

Is this log mechanism supposed to do that? If so, is that practical, as 
we want to test this script in the range of 1M jobs.

run342 is in swift-logs/wilde

The script is below.

My properties are:

sitedir.keep=true
lazy.errors=true
execution.retries=0
#kickstart.always.transfer=true

throttle.submit=off
throttle.host.submit=off
throttle.transfers=20
throttle.file.operations=20
throttle.score.job.factor=1000000
sitedir.keep=true


- Mike


type amout;

(amout ofile ) runam3 (string id , string dieselLowSLightLLProd, string 
dieselMedSLightLLProd)
{
   app { runam3 id dieselLowSLightLLProd dieselMedSLightLLProd ; }
}

type params {
   string id;
   string dieselLowSLightLLProd;
   string dieselMedSLightLLProd;
};

doall(params p[])
{
   foreach pset in p {
     amout ofile <single_file_mapper; file=@strcat("amdi.",pset.id)>;
     ofile = runam3(pset.id, pset.dieselLowSLightLLProd, 
pset.dieselMedSLightLLProd);
   }
}

// Main

params p[];
p = readdata("paramlist");
doall(p);


From benc at hawaga.org.uk  Mon Mar 24 03:26:40 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 24 Mar 2008 08:26:40 +0000 (GMT)
Subject: [Swift-devel] Re: Excessive object-closing messages in log
In-Reply-To: <47E71CBE.20000@mcs.anl.gov>
References: <47E71CBE.20000@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0803240823430.28951@dildano.hawaga.org.uk>


On Sun, 23 Mar 2008, Michael Wilde wrote:

> Ben, for a small swift script that iterates over a parameter array, I seem to
> be getting about (N^2)/2 log records regarding object closing.

For the script you gave, should be more like O(N). I'll have a poke around 
and see what's going on.

Also, in your script is there a reason you have a separate doall function 
rather than putting everything in the top level? This used to be a 
solution to a closing problem but I think that should have been fixed now.

Also, 

> throttle.score.job.factor=1000000                                               

That is documented as taking 'off' which is probably the effect you are 
trying to achieve with that value. Does that cause a problem for you?

-- 


From benc at hawaga.org.uk  Mon Mar 24 04:02:51 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 24 Mar 2008 09:02:51 +0000 (GMT)
Subject: [Swift-devel] Re: Excessive object-closing messages in log
In-Reply-To: <47E71CBE.20000@mcs.anl.gov>
References: <47E71CBE.20000@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0803240900480.28951@dildano.hawaga.org.uk>


On Sun, 23 Mar 2008, Michael Wilde wrote:

> Ben, for a small swift script that iterates over a parameter array, I seem to
> be getting about (N^2)/2 log records regarding object closing.

There was a debugging log loop that dumped the entire contents of the 
dataset closing-tracking cache every iteration of the loop (with that 
cache growing by one each iteration). This dump is gone as of r1759.

-- 


From benc at hawaga.org.uk  Mon Mar 24 04:28:00 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 24 Mar 2008 09:28:00 +0000 (GMT)
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <47E718F8.6010800@mcs.anl.gov>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
	<47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> 
	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk> 
	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk> 
	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>
	<47E1352E.3070808@mcs.anl.gov>
	<Pine.LNX.4.64.0803191746080.28951@dildano.hawaga.org.uk> 
	<47E15F6A.5080109@mcs.anl.gov>
	<Pine.LNX.4.64.0803192116000.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803202318120.28951@dildano.hawaga.org.uk> 
	<47E3A613.2040701@mcs.anl.gov>
	<1206108150.5100.0.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0803240010010.28951@dildano.hawaga.org.uk>
	<47E718F8.6010800@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0803240925060.28951@dildano.hawaga.org.uk>


On Sun, 23 Mar 2008, Michael Wilde wrote:

> I'm now going to start performance measurement and tuning on this now that
> things seem stable enough to do repeatable runs.

For worker side performance measurement, set 
wrapperlog.always.transfer=true and copy the resulting *.d directory 
(which should have the same basename as the logfile) into the log repo.

That will give a breakdown of what the worker is doing during the time 
that falkon says it is executing (which ideally will be almost entirely 
executing the actual application).

-- 


From benc at hawaga.org.uk  Mon Mar 24 04:46:20 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 24 Mar 2008 09:46:20 +0000 (GMT)
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <Pine.LNX.4.64.0803240925060.28951@dildano.hawaga.org.uk>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>
	<47DE676A.8090604@cs.uchicago.edu>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
	<47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> 
	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk> 
	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk> 
	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>
	<47E1352E.3070808@mcs.anl.gov>
	<Pine.LNX.4.64.0803191746080.28951@dildano.hawaga.org.uk> 
	<47E15F6A.5080109@mcs.anl.gov>
	<Pine.LNX.4.64.0803192116000.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803202318120.28951@dildano.hawaga.org.uk> 
	<47E3A613.2040701@mcs.anl.gov>
	<1206108150.5100.0.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0803240010010.28951@dildano.hawaga.org.uk>
	<47E718F8.6010800@mcs.anl.gov>
	<Pine.LNX.4.64.0803240925060.28951@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0803240941280.28951@dildano.hawaga.org.uk>


you can get plots for your 1000 job run here:

http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/

you're hitting the file transfer and file operation limits (that are 20 in 
your config) once jobs start staging out.

There's a wierd looking plateu in graph 'number of execute2 tasks at 
once:' around 170s .. 200s where no jobs complete for some time.

Getting the falkon logs and/or the wrapper (.d) logs would be interesting 
there.

these were generated on my laptop with:

make \
 LOG=/Users/benc/work/everylog/amps1-20080323-1935-su38n0k5.log clean \
 webpage.weights webpage.kara webpage

using the SVN log-procesisng code.
-- 


From iraicu at cs.uchicago.edu  Mon Mar 24 08:46:21 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Mon, 24 Mar 2008 08:46:21 -0500
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <47E718F8.6010800@mcs.anl.gov>
References: <47DDBFD7.2050700@mcs.anl.gov>	<Pine.LNX.4.64.0803171101290.28951@dildano.hawaga.org.uk>	<47DE676A.8090604@cs.uchicago.edu>	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>	<47DECC42.4020500@mcs.anl.gov>	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>	<47DED24B.8070900@mcs.anl.gov>
	<47DFCC33.30803@mcs.anl.gov>	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>	<47E1352E.3070808@mcs.anl.gov>	<Pine.LNX.4.64.0803191746080.28951@dildano.hawaga.org.uk>	<47E15F6A.5080109@mcs.anl.gov>	<Pine.LNX.4.64.0803192116000.28951@dildano.hawaga.org.uk>	<Pine.LNX.4.64.0803202318120.28951@dildano.hawaga.org.uk>	<47E3A613.2040701@mcs.anl.gov>	<1206108150.5100.0.camel@blabla.mcs.anl.gov>	<Pine.LNX.4.64.0803240010010.28951@dildano.hawaga.org.uk>
	<47E718F8.6010800@mcs.anl.gov>
Message-ID: <47E7B0AD.6000609@cs.uchicago.edu>


Michael Wilde wrote:
>
> Ioan: I do see that we're loosing some workers, so some investigation 
> is needed on the Falkon side.
>
I see that some workers remain in a pending state for about a minute, 
after which the same amount of workers that are in pending state 
register as new workers, and the number of available workers gets back 
up to what we had to begin with.  I think I removed a few months back 
the mechanism that went through periodically and cleaned up the pending 
workers... which is leaving them stuck in a pending state in the logs.  
I'll try to add that back in. 

Now, for the real problem why these workers are ending up in a pending 
state, and never going to a running state, we need to do more debugging. 

Ioan
>
> - Mike
>
>
> On 3/23/08 7:12 PM, Ben Clifford wrote:
>> On Fri, 21 Mar 2008, Mihael Hategan wrote:
>>
>>> On Fri, 2008-03-21 at 07:12 -0500, Michael Wilde wrote:
>>>> My latest test on runs of 25, 100, and 1000 jobs seem to indicate that
>>>> with a sync command at the end of the application script, all job 
>>>> status
>>>> and data is returned ok every time.
>>> Why not put it in the wrapper script at the end?
>>
>> Mike, the attached patch will do that, and will also add logging 
>> information so that we can see how long syncs are taking compared to 
>> other stages in worker node execution.
>>
>> cd cog/modules/vdsk
>> patch -p1 < sync-in-wrapper
>>
>>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


From iraicu at cs.uchicago.edu  Mon Mar 24 08:54:05 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Mon, 24 Mar 2008 08:54:05 -0500
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <Pine.LNX.4.64.0803240941280.28951@dildano.hawaga.org.uk>
References: <47DDBFD7.2050700@mcs.anl.gov>	<47DE676A.8090604@cs.uchicago.edu>	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>	<47DECC42.4020500@mcs.anl.gov>	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>	<47DED24B.8070900@mcs.anl.gov>
	<47DFCC33.30803@mcs.anl.gov>
	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>	<47E1352E.3070808@mcs.anl.gov>	<Pine.LNX.4.64.0803191746080.28951@dildano.hawaga.org.uk>
	<47E15F6A.5080109@mcs.anl.gov>	<Pine.LNX.4.64.0803192116000.28951@dildano.hawaga.org.uk>	<Pine.LNX.4.64.0803202318120.28951@dildano.hawaga.org.uk>
	<47E3A613.2040701@mcs.anl.gov>	<1206108150.5100.0.camel@blabla.mcs.anl.gov>	<Pine.LNX.4.64.0803240010010.28951@dildano.hawaga.org.uk>	<47E718F8.6010800@mcs.anl.gov>	<Pine.LNX.4.64.0803240925060.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803240941280.28951@dildano.hawaga.org.uk>
Message-ID: <47E7B27D.6050606@cs.uchicago.edu>

I see the plateau, but there are other graphs which seem to go crazy 
during those periods, such as
http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/karatasks.FILE_TRANSFER-total.png
http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/karatasks.FILE_OPERATION-total.png

Looking at the Falkon logs might reveal more about if the plateau was 
due to Falkon or not.  Where would I find the Falkon logs that correlate 
to these graphs?

Ioan

Ben Clifford wrote:
> you can get plots for your 1000 job run here:
>
> http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/
>
> you're hitting the file transfer and file operation limits (that are 20 in 
> your config) once jobs start staging out.
>
> There's a wierd looking plateu in graph 'number of execute2 tasks at 
> once:' around 170s .. 200s where no jobs complete for some time.
>
> Getting the falkon logs and/or the wrapper (.d) logs would be interesting 
> there.
>
> these were generated on my laptop with:
>
> make \
>  LOG=/Users/benc/work/everylog/amps1-20080323-1935-su38n0k5.log clean \
>  webpage.weights webpage.kara webpage
>
> using the SVN log-procesisng code.
>   

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


From wilde at mcs.anl.gov  Mon Mar 24 08:59:22 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 24 Mar 2008 08:59:22 -0500
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <47E7B27D.6050606@cs.uchicago.edu>
References: <47DDBFD7.2050700@mcs.anl.gov>	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>	<47DECC42.4020500@mcs.anl.gov>	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>	<47DED24B.8070900@mcs.anl.gov>
	<47DFCC33.30803@mcs.anl.gov>
	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>	<47E1352E.3070808@mcs.anl.gov>	<Pine.LNX.4.64.0803191746080.28951@dildano.hawaga.org.uk>
	<47E15F6A.5080109@mcs.anl.gov>	<Pine.LNX.4.64.0803192116000.28951@dildano.hawaga.org.uk>	<Pine.LNX.4.64.0803202318120.28951@dildano.hawaga.org.uk>
	<47E3A613.2040701@mcs.anl.gov>	<1206108150.5100.0.camel@blabla.mcs.anl.gov>	<Pine.LNX.4.64.0803240010010.28951@dildano.hawaga.org.uk>	<47E718F8.6010800@mcs.anl.gov>	<Pine.LNX.4.64.0803240925060.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803240941280.28951@dildano.hawaga.org.uk>
	<47E7B27D.6050606@cs.uchicago.edu>
Message-ID: <47E7B3BA.9040502@mcs.anl.gov>


On 3/24/08 8:54 AM, Ioan Raicu wrote:
> I see the plateau, but there are other graphs which seem to go crazy 
> during those periods, such as
> http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/karatasks.FILE_TRANSFER-total.png 
> 
> http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/karatasks.FILE_OPERATION-total.png 
> 
> 
> Looking at the Falkon logs might reveal more about if the plateau was 
> due to Falkon or not.  Where would I find the Falkon logs that correlate 
> to these graphs?
\

bblogin.mcs.anl.gov:/home/wilde/falkon/logs

> 
> Ioan
> 
> Ben Clifford wrote:
>> you can get plots for your 1000 job run here:
>>
>> http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/
>>
>> you're hitting the file transfer and file operation limits (that are 
>> 20 in your config) once jobs start staging out.
>>
>> There's a wierd looking plateu in graph 'number of execute2 tasks at 
>> once:' around 170s .. 200s where no jobs complete for some time.
>>
>> Getting the falkon logs and/or the wrapper (.d) logs would be 
>> interesting there.
>>
>> these were generated on my laptop with:
>>
>> make \
>>  LOG=/Users/benc/work/everylog/amps1-20080323-1935-su38n0k5.log clean \
>>  webpage.weights webpage.kara webpage
>>
>> using the SVN log-procesisng code.
>>   
> 


From iraicu at cs.uchicago.edu  Mon Mar 24 09:47:41 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Mon, 24 Mar 2008 09:47:41 -0500
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <47E7B3BA.9040502@mcs.anl.gov>
References: <47DDBFD7.2050700@mcs.anl.gov>	<47DECC42.4020500@mcs.anl.gov>	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>	<47DED24B.8070900@mcs.anl.gov>
	<47DFCC33.30803@mcs.anl.gov>
	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>	<47E1352E.3070808@mcs.anl.gov>	<Pine.LNX.4.64.0803191746080.28951@dildano.hawaga.org.uk>
	<47E15F6A.5080109@mcs.anl.gov>	<Pine.LNX.4.64.0803192116000.28951@dildano.hawaga.org.uk>	<Pine.LNX.4.64.0803202318120.28951@dildano.hawaga.org.uk>
	<47E3A613.2040701@mcs.anl.gov>	<1206108150.5100.0.camel@blabla.mcs.anl.gov>	<Pine.LNX.4.64.0803240010010.28951@dildano.hawaga.org.uk>	<47E718F8.6010800@mcs.anl.gov>	<Pine.LNX.4.64.0803240925060.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803240941280.28951@dildano.hawaga.org.uk>
	<47E7B27D.6050606@cs.uchicago.edu> <47E7B3BA.9040502@mcs.anl.gov>
Message-ID: <47E7BF0D.1010104@cs.uchicago.edu>


Michael Wilde wrote:
>
>
> On 3/24/08 8:54 AM, Ioan Raicu wrote:
>> I see the plateau, but there are other graphs which seem to go crazy 
>> during those periods, such as
>> http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/karatasks.FILE_TRANSFER-total.png 
>>
>> http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/karatasks.FILE_OPERATION-total.png 
>>
>>
>> Looking at the Falkon logs might reveal more about if the plateau was 
>> due to Falkon or not.  Where would I find the Falkon logs that 
>> correlate to these graphs?
> \
>
> bblogin.mcs.anl.gov:/home/wilde/falkon/logs
>
But which ones, there are about 8 dirs (or something similar), there, 
and each dir contains multiple runs... how can I tell which log, and 
which parts of the logs are the run that is graphed by Ben?
>>


From wilde at mcs.anl.gov  Mon Mar 24 10:05:16 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 24 Mar 2008 10:05:16 -0500
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <47E7BF0D.1010104@cs.uchicago.edu>
References: <47DDBFD7.2050700@mcs.anl.gov>	<47DECC42.4020500@mcs.anl.gov>	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>	<47DED24B.8070900@mcs.anl.gov>
	<47DFCC33.30803@mcs.anl.gov>
	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>	<47E1352E.3070808@mcs.anl.gov>	<Pine.LNX.4.64.0803191746080.28951@dildano.hawaga.org.uk>
	<47E15F6A.5080109@mcs.anl.gov>	<Pine.LNX.4.64.0803192116000.28951@dildano.hawaga.org.uk>	<Pine.LNX.4.64.0803202318120.28951@dildano.hawaga.org.uk>
	<47E3A613.2040701@mcs.anl.gov>	<1206108150.5100.0.camel@blabla.mcs.anl.gov>	<Pine.LNX.4.64.0803240010010.28951@dildano.hawaga.org.uk>	<47E718F8.6010800@mcs.anl.gov>	<Pine.LNX.4.64.0803240925060.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803240941280.28951@dildano.hawaga.org.uk>
	<47E7B27D.6050606@cs.uchicago.edu> <47E7B3BA.9040502@mcs.anl.gov>
	<47E7BF0D.1010104@cs.uchicago.edu>
Message-ID: <47E7C32C.8040002@mcs.anl.gov>


On 3/24/08 9:47 AM, Ioan Raicu wrote:

>>>
>>> Looking at the Falkon logs might reveal more about if the plateau was 
>>> due to Falkon or not.  Where would I find the Falkon logs that 
>>> correlate to these graphs?
>> \
>>
>> bblogin.mcs.anl.gov:/home/wilde/falkon/logs
>>
> But which ones, there are about 8 dirs (or something similar), there, 
> and each dir contains multiple runs... how can I tell which log, and 
> which parts of the logs are the run that is graphed by Ben?
>>>
> 

The pointer to that was in this email:

On 3/24/08 7:55 AM, Michael Wilde wrote:
 >
 > The summary logs are on bblogin in ~wilde/falkon.
 >
 > I keep swift run logs in ~wilde/swift/logs/runNNN and copy some of them
 > to the CI NFS at ~benc/swift-logs/wilde/runNNN.
 >
 > I'll start copying the falkon logs to the same place, but for the
 > previous tests you'll need to locate them separately.
 >
 > In the swift log dirs, amps*.log shows the run time (look at first and
 > last line).  Ben pointed out that I'm doing data transfer throttling
 > which is slowing things down. I did that intentionally at this stage to
 > avoid hurting sico NFS. I'll start opening that throttle up.


From iraicu at cs.uchicago.edu  Mon Mar 24 11:48:16 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Mon, 24 Mar 2008 11:48:16 -0500
Subject: [Swift-devel] Re: swift-falkon problem... plots to explain
	plateaus...
In-Reply-To: <47E7B27D.6050606@cs.uchicago.edu>
References: <47DDBFD7.2050700@mcs.anl.gov>	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>	<47DECC42.4020500@mcs.anl.gov>	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>	<47DED24B.8070900@mcs.anl.gov>	<47DFCC33.30803@mcs.anl.gov>	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>	<47E1352E.3070808@mcs.anl.gov>	<Pine.LNX.4.64.0803191746080.28951@dildano.hawaga.org.uk>	<47E15F6A.5080109@mcs.anl.gov>	<Pine.LNX.4.64.0803192116000.28951@dildano.hawaga.org.uk>	<Pine.LNX.4.64.0803202318120.28951@dildano.hawaga.org.uk>	<47E3A613.2040701@mcs.anl.gov>	<1206108150.5100.0.camel@blabla.mcs.anl.gov>	<Pine.LNX.4.64.0803240010010.28951@dildano.hawaga.org.uk>	<47E718F8.6010800@mcs.anl.gov>	<Pine.LNX.4.64.0803240925060.28951@dildano.hawaga.org.uk>	<Pine.LNX.4.64.0803240941280.28951@dildano.hawaga.org.uk>
	<47E7B27D.6050606@cs.uchicago.edu>
Message-ID: <47E7DB50.2020407@cs.uchicago.edu>

.OK, here is my analysis of the plateaus, from Falkon's point of view.

Notice the per task execution (green) is about 100 seconds per job, 
where the job is some invocation of the wrapper.sh that Swift sent to 
Falkon.  Things look normal so far.  See the 2nd graph for more...


This shows that there are 600 workers (600 CPUs), which all get their 
work within 10 seconds... then they all churn away until about 100 sec 
when jobs start completing, and new ones get dispatched.  At around 132 
seconds, the wait queue is empty, and some workers start becoming idle 
(the red area)... by time 155, the initial 600 jobs that started between 
time 0 and 10, have completed, and from 155 to 211, the remaining 400 
jobs all run to completion; they really only start completing around 190 
sec, and all finish by 211.  So, the plateau, that is evident here as 
well, is really when 400 workers are executing 400 jobs in parallel, and 
since the jobs are taking around 100 sec each to complete, the plateau 
of 50 seconds is completely normal.  See more after the graph...


Now the real question is, what is the breakdown of the 100 sec 
invocation (108.645 sec on average to be exact), how much is due to 
wrapper.sh, and how much is due to the application itself?  Mike, can 
you comment on this?  I assume you are running amiga which should have 
0.5 sec jobs, right?

Ioan

Ioan Raicu wrote:
> I see the plateau, but there are other graphs which seem to go crazy 
> during those periods, such as
> http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/karatasks.FILE_TRANSFER-total.png 
>
> http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/karatasks.FILE_OPERATION-total.png 
>
>
> Looking at the Falkon logs might reveal more about if the plateau was 
> due to Falkon or not.  Where would I find the Falkon logs that 
> correlate to these graphs?
>
> Ioan
>
> Ben Clifford wrote:
>> you can get plots for your 1000 job run here:
>>
>> http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/
>>
>> you're hitting the file transfer and file operation limits (that are 
>> 20 in your config) once jobs start staging out.
>>
>> There's a wierd looking plateu in graph 'number of execute2 tasks at 
>> once:' around 170s .. 200s where no jobs complete for some time.
>>
>> Getting the falkon logs and/or the wrapper (.d) logs would be 
>> interesting there.
>>
>> these were generated on my laptop with:
>>
>> make \
>>  LOG=/Users/benc/work/everylog/amps1-20080323-1935-su38n0k5.log clean \
>>  webpage.weights webpage.kara webpage
>>
>> using the SVN log-procesisng code.
>>   
>

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080324/9cef415c/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/jpeg
Size: 38387 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080324/9cef415c/attachment.jpe>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/jpeg
Size: 47424 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080324/9cef415c/attachment-0001.jpe>

From wilde at mcs.anl.gov  Mon Mar 24 12:21:14 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 24 Mar 2008 12:21:14 -0500
Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to
	explain plateaus...]
In-Reply-To: <47E7DC3D.6040704@cs.uchicago.edu>
References: <47E7DC3D.6040704@cs.uchicago.edu>
Message-ID: <47E7E30A.2020208@mcs.anl.gov>

 > Now the real question is, what is the breakdown of the 100 sec
 > invocation (108.645 sec on average to be exact), how much is due to
 > wrapper.sh, and how much is due to the application itself?  Mike, can
 > you comment on this?  I assume you are running amiga which should have
 > 0.5 sec jobs, right?

Amiga is about .5 secs and teh script that runs (runam3) I think adds 
another .5 secs (from a quick scan of falkon logs on the actual task run 
time - but please verify, I think you have all the data from the task log).

I suspect, as you and I both agree, that hundreds of short jobs starting 
in some small interval causes heavy NFS activity. The next round of 
testing we'll do should start to pick this apart, determine causes and 
prototype improvements.

- Mike


On 3/24/08 11:52 AM, Ioan Raicu wrote:
> Not sure if this email made it to the mailing list, due to the larger 
> size (128KB)...
> 
> Ioan
> 
> ------------------------------------------------------------------------
> 
> Subject:
> Re: [Swift-devel] Re: swift-falkon problem... plots to explain plateaus...
> From:
> Ioan Raicu <iraicu at cs.uchicago.edu>
> Date:
> Mon, 24 Mar 2008 11:48:16 -0500
> To:
> Ben Clifford <benc at hawaga.org.uk>
> 
> To:
> Ben Clifford <benc at hawaga.org.uk>
> CC:
> swift-devel <swift-devel at ci.uchicago.edu>
> 
> 
> .OK, here is my analysis of the plateaus, from Falkon's point of view.
> 
> Notice the per task execution (green) is about 100 seconds per job, 
> where the job is some invocation of the wrapper.sh that Swift sent to 
> Falkon.  Things look normal so far.  See the 2nd graph for more...
> 
> 
> This shows that there are 600 workers (600 CPUs), which all get their 
> work within 10 seconds... then they all churn away until about 100 sec 
> when jobs start completing, and new ones get dispatched.  At around 132 
> seconds, the wait queue is empty, and some workers start becoming idle 
> (the red area)... by time 155, the initial 600 jobs that started between 
> time 0 and 10, have completed, and from 155 to 211, the remaining 400 
> jobs all run to completion; they really only start completing around 190 
> sec, and all finish by 211.  So, the plateau, that is evident here as 
> well, is really when 400 workers are executing 400 jobs in parallel, and 
> since the jobs are taking around 100 sec each to complete, the plateau 
> of 50 seconds is completely normal.  See more after the graph...
> 
> 
> Now the real question is, what is the breakdown of the 100 sec 
> invocation (108.645 sec on average to be exact), how much is due to 
> wrapper.sh, and how much is due to the application itself?  Mike, can 
> you comment on this?  I assume you are running amiga which should have 
> 0.5 sec jobs, right?
> 
> Ioan
> 
> Ioan Raicu wrote:
>> I see the plateau, but there are other graphs which seem to go crazy 
>> during those periods, such as
>> http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/karatasks.FILE_TRANSFER-total.png 
>>
>> http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/karatasks.FILE_OPERATION-total.png 
>>
>>
>> Looking at the Falkon logs might reveal more about if the plateau was 
>> due to Falkon or not.  Where would I find the Falkon logs that 
>> correlate to these graphs?
>>
>> Ioan
>>
>> Ben Clifford wrote:
>>> you can get plots for your 1000 job run here:
>>>
>>> http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/
>>>
>>> you're hitting the file transfer and file operation limits (that are 
>>> 20 in your config) once jobs start staging out.
>>>
>>> There's a wierd looking plateu in graph 'number of execute2 tasks at 
>>> once:' around 170s .. 200s where no jobs complete for some time.
>>>
>>> Getting the falkon logs and/or the wrapper (.d) logs would be 
>>> interesting there.
>>>
>>> these were generated on my laptop with:
>>>
>>> make \
>>>  LOG=/Users/benc/work/everylog/amps1-20080323-1935-su38n0k5.log clean \
>>>  webpage.weights webpage.kara webpage
>>>
>>> using the SVN log-procesisng code.
>>>   
>>
> 
> -- 
> ===================================================
> Ioan Raicu
> Ph.D. Candidate
> ===================================================
> Distributed Systems Laboratory
> Computer Science Department
> University of Chicago
> 1100 E. 58th Street, Ryerson Hall
> Chicago, IL 60637
> ===================================================
> Email: iraicu at cs.uchicago.edu
> Web:   http://www.cs.uchicago.edu/~iraicu
> http://dev.globus.org/wiki/Incubator/Falkon
> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
> ===================================================
> ===================================================
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel


From iraicu at cs.uchicago.edu  Mon Mar 24 12:36:04 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Mon, 24 Mar 2008 12:36:04 -0500
Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to
	explain plateaus...]
In-Reply-To: <47E7E30A.2020208@mcs.anl.gov>
References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov>
Message-ID: <47E7E684.5070303@cs.uchicago.edu>


Michael Wilde wrote:
> > Now the real question is, what is the breakdown of the 100 sec
> > invocation (108.645 sec on average to be exact), how much is due to
> > wrapper.sh, and how much is due to the application itself?  Mike, can
> > you comment on this?  I assume you are running amiga which should have
> > 0.5 sec jobs, right?
>
> Amiga is about .5 secs and teh script that runs (runam3) I think adds 
> another .5 secs (from a quick scan of falkon logs on the actual task 
> run time - but please verify, I think you have all the data from the 
> task log).
The log with 1000 tasks, the shortest job was 72 secs, average 108, and 
max 170 sec.  Is amiga working from RAM, or is it from NFS?  If its from 
NFS, how big is the input data and script?  I thought it was about 
10KB?  The overall throughput was 6.6 jobs/sec, so that is only 66KB/s, 
which seems quite small, assuming that each read is done in large 
chunks, and not a few bytes at a time. 
>
> I suspect, as you and I both agree, that hundreds of short jobs 
> starting in some small interval causes heavy NFS activity. 
Yes, but is the NFS activity due to the app, or due to wrapper.sh?

I would replace the amiga app with a sleep 0.5, or sleep 1, just to see 
if the graph looks much different or not.  That will surely isolate the 
overhead from your app or wrapper.sh.

Ioan
> The next round of testing we'll do should start to pick this apart, 
> determine causes and prototype improvements.
>
> - Mike
>
>
> On 3/24/08 11:52 AM, Ioan Raicu wrote:
>> Not sure if this email made it to the mailing list, due to the larger 
>> size (128KB)...
>>
>> Ioan
>>
>> ------------------------------------------------------------------------
>>
>> Subject:
>> Re: [Swift-devel] Re: swift-falkon problem... plots to explain 
>> plateaus...
>> From:
>> Ioan Raicu <iraicu at cs.uchicago.edu>
>> Date:
>> Mon, 24 Mar 2008 11:48:16 -0500
>> To:
>> Ben Clifford <benc at hawaga.org.uk>
>>
>> To:
>> Ben Clifford <benc at hawaga.org.uk>
>> CC:
>> swift-devel <swift-devel at ci.uchicago.edu>
>>
>>
>> .OK, here is my analysis of the plateaus, from Falkon's point of view.
>>
>> Notice the per task execution (green) is about 100 seconds per job, 
>> where the job is some invocation of the wrapper.sh that Swift sent to 
>> Falkon.  Things look normal so far.  See the 2nd graph for more...
>>
>>
>> This shows that there are 600 workers (600 CPUs), which all get their 
>> work within 10 seconds... then they all churn away until about 100 
>> sec when jobs start completing, and new ones get dispatched.  At 
>> around 132 seconds, the wait queue is empty, and some workers start 
>> becoming idle (the red area)... by time 155, the initial 600 jobs 
>> that started between time 0 and 10, have completed, and from 155 to 
>> 211, the remaining 400 jobs all run to completion; they really only 
>> start completing around 190 sec, and all finish by 211.  So, the 
>> plateau, that is evident here as well, is really when 400 workers are 
>> executing 400 jobs in parallel, and since the jobs are taking around 
>> 100 sec each to complete, the plateau of 50 seconds is completely 
>> normal.  See more after the graph...
>>
>>
>> Now the real question is, what is the breakdown of the 100 sec 
>> invocation (108.645 sec on average to be exact), how much is due to 
>> wrapper.sh, and how much is due to the application itself?  Mike, can 
>> you comment on this?  I assume you are running amiga which should 
>> have 0.5 sec jobs, right?
>>
>> Ioan
>>
>> Ioan Raicu wrote:
>>> I see the plateau, but there are other graphs which seem to go crazy 
>>> during those periods, such as
>>> http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/karatasks.FILE_TRANSFER-total.png 
>>>
>>> http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/karatasks.FILE_OPERATION-total.png 
>>>
>>>
>>> Looking at the Falkon logs might reveal more about if the plateau 
>>> was due to Falkon or not.  Where would I find the Falkon logs that 
>>> correlate to these graphs?
>>>
>>> Ioan
>>>
>>> Ben Clifford wrote:
>>>> you can get plots for your 1000 job run here:
>>>>
>>>> http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/
>>>>
>>>> you're hitting the file transfer and file operation limits (that 
>>>> are 20 in your config) once jobs start staging out.
>>>>
>>>> There's a wierd looking plateu in graph 'number of execute2 tasks 
>>>> at once:' around 170s .. 200s where no jobs complete for some time.
>>>>
>>>> Getting the falkon logs and/or the wrapper (.d) logs would be 
>>>> interesting there.
>>>>
>>>> these were generated on my laptop with:
>>>>
>>>> make \
>>>>  LOG=/Users/benc/work/everylog/amps1-20080323-1935-su38n0k5.log 
>>>> clean \
>>>>  webpage.weights webpage.kara webpage
>>>>
>>>> using the SVN log-procesisng code.
>>>>   
>>>
>>
>> -- 
>> ===================================================
>> Ioan Raicu
>> Ph.D. Candidate
>> ===================================================
>> Distributed Systems Laboratory
>> Computer Science Department
>> University of Chicago
>> 1100 E. 58th Street, Ryerson Hall
>> Chicago, IL 60637
>> ===================================================
>> Email: iraicu at cs.uchicago.edu
>> Web:   http://www.cs.uchicago.edu/~iraicu
>> http://dev.globus.org/wiki/Incubator/Falkon
>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
>> ===================================================
>> ===================================================
>>
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


From wilde at mcs.anl.gov  Mon Mar 24 12:59:31 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 24 Mar 2008 12:59:31 -0500
Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to
	explain plateaus...]
In-Reply-To: <47E7E684.5070303@cs.uchicago.edu>
References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov>
	<47E7E684.5070303@cs.uchicago.edu>
Message-ID: <47E7EC03.1090609@mcs.anl.gov>


On 3/24/08 12:36 PM, Ioan Raicu wrote:
> 
> 
> Michael Wilde wrote:
>> > Now the real question is, what is the breakdown of the 100 sec
>> > invocation (108.645 sec on average to be exact), how much is due to
>> > wrapper.sh, and how much is due to the application itself?  Mike, can
>> > you comment on this?  I assume you are running amiga which should have
>> > 0.5 sec jobs, right?
>>
>> Amiga is about .5 secs and teh script that runs (runam3) I think adds 
>> another .5 secs (from a quick scan of falkon logs on the actual task 
>> run time - but please verify, I think you have all the data from the 
>> task log).
> The log with 1000 tasks, the shortest job was 72 secs, average 108, and 
> max 170 sec.  Is amiga working from RAM, or is it from NFS?  If its from 
> NFS, how big is the input data and script?  I thought it was about 
> 10KB?  The overall throughput was 6.6 jobs/sec, so that is only 66KB/s, 
> which seems quite small, assuming that each read is done in large 
> chunks, and not a few bytes at a time.
>>
>> I suspect, as you and I both agree, that hundreds of short jobs 
>> starting in some small interval causes heavy NFS activity. 
> Yes, but is the NFS activity due to the app, or due to wrapper.sh?

Its due to both. wrapper.sh fetches the app script from nfs which 
fetches the app from nfs.  then wrapper.sh does its setup, which causes 
more (synchronous) nfs activity, then the app output is copied, then 
fetched back to the run directory.

All this is dominated I suspect by nfs request overhead, most of which 
is not data transfer.

There's really nothing to discuss regarding this until I get some data 
from tests.

- Mike

> 
> I would replace the amiga app with a sleep 0.5, or sleep 1, just to see 
> if the graph looks much different or not.  That will surely isolate the 
> overhead from your app or wrapper.sh.
> 
> Ioan
>> The next round of testing we'll do should start to pick this apart, 
>> determine causes and prototype improvements.
>>
>> - Mike
>>
>>
>> On 3/24/08 11:52 AM, Ioan Raicu wrote:
>>> Not sure if this email made it to the mailing list, due to the larger 
>>> size (128KB)...
>>>
>>> Ioan
>>>
>>> ------------------------------------------------------------------------
>>>
>>> Subject:
>>> Re: [Swift-devel] Re: swift-falkon problem... plots to explain 
>>> plateaus...
>>> From:
>>> Ioan Raicu <iraicu at cs.uchicago.edu>
>>> Date:
>>> Mon, 24 Mar 2008 11:48:16 -0500
>>> To:
>>> Ben Clifford <benc at hawaga.org.uk>
>>>
>>> To:
>>> Ben Clifford <benc at hawaga.org.uk>
>>> CC:
>>> swift-devel <swift-devel at ci.uchicago.edu>
>>>
>>>
>>> .OK, here is my analysis of the plateaus, from Falkon's point of view.
>>>
>>> Notice the per task execution (green) is about 100 seconds per job, 
>>> where the job is some invocation of the wrapper.sh that Swift sent to 
>>> Falkon.  Things look normal so far.  See the 2nd graph for more...
>>>
>>>
>>> This shows that there are 600 workers (600 CPUs), which all get their 
>>> work within 10 seconds... then they all churn away until about 100 
>>> sec when jobs start completing, and new ones get dispatched.  At 
>>> around 132 seconds, the wait queue is empty, and some workers start 
>>> becoming idle (the red area)... by time 155, the initial 600 jobs 
>>> that started between time 0 and 10, have completed, and from 155 to 
>>> 211, the remaining 400 jobs all run to completion; they really only 
>>> start completing around 190 sec, and all finish by 211.  So, the 
>>> plateau, that is evident here as well, is really when 400 workers are 
>>> executing 400 jobs in parallel, and since the jobs are taking around 
>>> 100 sec each to complete, the plateau of 50 seconds is completely 
>>> normal.  See more after the graph...
>>>
>>>
>>> Now the real question is, what is the breakdown of the 100 sec 
>>> invocation (108.645 sec on average to be exact), how much is due to 
>>> wrapper.sh, and how much is due to the application itself?  Mike, can 
>>> you comment on this?  I assume you are running amiga which should 
>>> have 0.5 sec jobs, right?
>>>
>>> Ioan
>>>
>>> Ioan Raicu wrote:
>>>> I see the plateau, but there are other graphs which seem to go crazy 
>>>> during those periods, such as
>>>> http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/karatasks.FILE_TRANSFER-total.png 
>>>>
>>>> http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/karatasks.FILE_OPERATION-total.png 
>>>>
>>>>
>>>> Looking at the Falkon logs might reveal more about if the plateau 
>>>> was due to Falkon or not.  Where would I find the Falkon logs that 
>>>> correlate to these graphs?
>>>>
>>>> Ioan
>>>>
>>>> Ben Clifford wrote:
>>>>> you can get plots for your 1000 job run here:
>>>>>
>>>>> http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/
>>>>>
>>>>> you're hitting the file transfer and file operation limits (that 
>>>>> are 20 in your config) once jobs start staging out.
>>>>>
>>>>> There's a wierd looking plateu in graph 'number of execute2 tasks 
>>>>> at once:' around 170s .. 200s where no jobs complete for some time.
>>>>>
>>>>> Getting the falkon logs and/or the wrapper (.d) logs would be 
>>>>> interesting there.
>>>>>
>>>>> these were generated on my laptop with:
>>>>>
>>>>> make \
>>>>>  LOG=/Users/benc/work/everylog/amps1-20080323-1935-su38n0k5.log 
>>>>> clean \
>>>>>  webpage.weights webpage.kara webpage
>>>>>
>>>>> using the SVN log-procesisng code.
>>>>>   
>>>>
>>>
>>> -- 
>>> ===================================================
>>> Ioan Raicu
>>> Ph.D. Candidate
>>> ===================================================
>>> Distributed Systems Laboratory
>>> Computer Science Department
>>> University of Chicago
>>> 1100 E. 58th Street, Ryerson Hall
>>> Chicago, IL 60637
>>> ===================================================
>>> Email: iraicu at cs.uchicago.edu
>>> Web:   http://www.cs.uchicago.edu/~iraicu
>>> http://dev.globus.org/wiki/Incubator/Falkon
>>> http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
>>> ===================================================
>>> ===================================================
>>>
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>
> 


From hategan at mcs.anl.gov  Mon Mar 24 13:03:12 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 24 Mar 2008 13:03:12 -0500
Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to
	explain plateaus...]
In-Reply-To: <47E7EC03.1090609@mcs.anl.gov>
References: <47E7DC3D.6040704@cs.uchicago.edu>
	<47E7E30A.2020208@mcs.anl.gov> <47E7E684.5070303@cs.uchicago.edu>
	<47E7EC03.1090609@mcs.anl.gov>
Message-ID: <1206381792.11561.1.camel@blabla.mcs.anl.gov>


> Its due to both. wrapper.sh fetches the app script from nfs which 
> fetches the app from nfs.  then wrapper.sh does its setup, which causes 
> more (synchronous) nfs activity, then the app output is copied, then 
> fetched back to the run directory.
> 
> All this is dominated I suspect by nfs request overhead, most of which 
> is not data transfer.
> 
> There's really nothing to discuss regarding this until I get some data 
> from tests.

As far as I can remember, Ben added fairly comprehensive logging to the
wrapper. That may shed some light on the issue.

Mihael


From benc at hawaga.org.uk  Mon Mar 24 15:16:13 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 24 Mar 2008 20:16:13 +0000 (GMT)
Subject: [Swift-devel] Re: swift-falkon problem
In-Reply-To: <47E7B27D.6050606@cs.uchicago.edu>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171917500.28951@dildano.hawaga.org.uk>
	<47DECC42.4020500@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
	<47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov> 
	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk> 
	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk> 
	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>
	<47E1352E.3070808@mcs.anl.gov>
	<Pine.LNX.4.64.0803191746080.28951@dildano.hawaga.org.uk>
	<47E15F6A.5080109@mcs.anl.gov>
	<Pine.LNX.4.64.0803192116000.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803202318120.28951@dildano.hawaga.org.uk>
	<47E3A613.2040701@mcs.anl.gov>
	<1206108150.5100.0.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0803240010010.28951@dildano.hawaga.org.uk>
	<47E718F8.6010800@mcs.anl.gov>
	<Pine.LNX.4.64.0803240925060.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803240941280.28951@dildano.hawaga.org.uk>
	<47E7B27D.6050606@cs.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0803242014310.28951@dildano.hawaga.org.uk>


On Mon, 24 Mar 2008, Ioan Raicu wrote:

> I see the plateau, but there are other graphs which seem to go crazy during
> those periods, such as
> http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/karatasks.FILE_TRANSFER-total.png
> http://www.ci.uchicago.edu/~benc/report-amps1-20080323-1935-su38n0k5/karatasks.FILE_OPERATION-total.png
> 
> Looking at the Falkon logs might reveal more about if the plateau was due to
> Falkon or not.  Where would I find the Falkon logs that correlate to these
> graphs?

I haven't looked at the falkon logs because I dont' have them, but that 
load corresponds roughly with apparently a bunch of jobs finishing and 
their results being staged out. As far as I can tell, the workflow looks 
like minimal stage in (so not much FILE_TRANSFER load), then a bunch of 
jobs that take around 120s; then those all start finishing around t=120 so 
now swift is doing lots of file transfer.

-- 


From benc at hawaga.org.uk  Mon Mar 24 15:24:13 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 24 Mar 2008 20:24:13 +0000 (GMT)
Subject: [Swift-devel] Re: swift-falkon problem... plots to explain
	plateaus...
In-Reply-To: <47E7DB50.2020407@cs.uchicago.edu>
References: <47DDBFD7.2050700@mcs.anl.gov> <47DECC42.4020500@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
	<47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov>
	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>
	<47E1352E.3070808@mcs.anl.gov>
	<Pine.LNX.4.64.0803191746080.28951@dildano.hawaga.org.uk>
	<47E15F6A.5080109@mcs.anl.gov>
	<Pine.LNX.4.64.0803192116000.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803202318120.28951@dildano.hawaga.org.uk>
	<47E3A613.2040701@mcs.anl.gov>
	<1206108150.5100.0.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0803240010010.28951@dildano.hawaga.org.uk>
	<47E718F8.6010800@mcs.anl.gov>
	<Pine.LNX.4.64.0803240925060.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803240941280.28951@dildano.hawaga.org.uk>
	<47E7B27D.6050606@cs.uchicago.edu> <47E7DB50.2020407@cs.uchicago.edu>
Message-ID: <Pine.LNX.4.64.0803242021160.28951@dildano.hawaga.org.uk>


On Mon, 24 Mar 2008, Ioan Raicu wrote:

> start completing, and new ones get dispatched.  At around 132 seconds, the
> wait queue is empty, and some workers start becoming idle (the red area)... by

Ideally swift would be keeping the queue full until there is nothing left 
to send. There shouldn't be two distinct 600 and 400 job bursts. But I 
guess that may be because the job throttling isn't set to a large enough 
infinity.

> Now the real question is, what is the breakdown of the 100 sec invocation
> (108.645 sec on average to be exact), how much is due to wrapper.sh, and how
> much is due to the application itself?  Mike, can you comment on this?  I
> assume you are running amiga which should have 0.5 sec jobs, right?

running with wrapperlog.always.transfer=true will grab the raw data for 
this.

-- 


From wilde at mcs.anl.gov  Mon Mar 24 17:13:01 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 24 Mar 2008 17:13:01 -0500
Subject: [Swift-devel] Why build prompts in redist?
In-Reply-To: <Pine.LNX.4.64.0803190139490.5372@dildano.hawaga.org.uk>
References: <47E0422D.9000707@mcs.anl.gov>
	<Pine.LNX.4.64.0803190139490.5372@dildano.hawaga.org.uk>
Message-ID: <47E8276D.1020809@mcs.anl.gov>

I just updated to 1756 and I still get same prompt. This is not a big 
deal, just thought you'd want to know.

My command was:

   ant -Dwith-provider-deef redist

in cog/modules/vdsk

- Mike


dist.dir.warning:
     [input]
     [input] 
======================================================================================
     [input] Warning! The specified target directory 
(/radix-homes01/wilde/swift/src/cog/modules/vdsk/../..//modules/vdsk/dist/vdsk-svn) 
does not seem to contain a Swift build.
     [input] Press Return to continue with the build or CTRL+C to abort...
     [input] 
======================================================================================
     [input]


On 3/18/08 8:40 PM, Ben Clifford wrote:
>>     [input] Warning! The specified target directory
>> (/radix-homes01/wilde/swift/src/cog/modules/vdsk/../..//modules/vdsk/dist/vdsk-svn)
>> does not seem to contain a Swift build.
>>     [input] Press Return to continue with the build or CTRL+C to abort...
> 
> 
> As of r1738 (to provider-deef) this does not happen any more.
> 


From wilde at mcs.anl.gov  Mon Mar 24 17:14:10 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 24 Mar 2008 17:14:10 -0500
Subject: [Swift-devel] Why build prompts in redist?
In-Reply-To: <47E8276D.1020809@mcs.anl.gov>
References: <47E0422D.9000707@mcs.anl.gov>
	<Pine.LNX.4.64.0803190139490.5372@dildano.hawaga.org.uk>
	<47E8276D.1020809@mcs.anl.gov>
Message-ID: <47E827B2.1090802@mcs.anl.gov>

correction: 1759

On 3/24/08 5:13 PM, Michael Wilde wrote:
> I just updated to 1756 and I still get same prompt. This is not a big 
> deal, just thought you'd want to know.
> 
> My command was:
> 
>   ant -Dwith-provider-deef redist
> 
> in cog/modules/vdsk
> 
> - Mike
> 
> 
> dist.dir.warning:
>     [input]
>     [input] 
> ====================================================================================== 
> 
>     [input] Warning! The specified target directory 
> (/radix-homes01/wilde/swift/src/cog/modules/vdsk/../..//modules/vdsk/dist/vdsk-svn) 
> does not seem to contain a Swift build.
>     [input] Press Return to continue with the build or CTRL+C to abort...
>     [input] 
> ====================================================================================== 
> 
>     [input]
> 
> 
> On 3/18/08 8:40 PM, Ben Clifford wrote:
>>>     [input] Warning! The specified target directory
>>> (/radix-homes01/wilde/swift/src/cog/modules/vdsk/../..//modules/vdsk/dist/vdsk-svn) 
>>>
>>> does not seem to contain a Swift build.
>>>     [input] Press Return to continue with the build or CTRL+C to 
>>> abort...
>>
>>
>> As of r1738 (to provider-deef) this does not happen any more.
>>
> 


From hategan at mcs.anl.gov  Mon Mar 24 17:20:14 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 24 Mar 2008 17:20:14 -0500
Subject: [Swift-devel] Why build prompts in redist?
In-Reply-To: <47E8276D.1020809@mcs.anl.gov>
References: <47E0422D.9000707@mcs.anl.gov>
	<Pine.LNX.4.64.0803190139490.5372@dildano.hawaga.org.uk>
	<47E8276D.1020809@mcs.anl.gov>
Message-ID: <1206397214.25801.1.camel@blabla.mcs.anl.gov>

Seems quite obvious, given that the warning was meant to deal with a
situation where one would build provider-deef after building swift,
whereas the -Dwith-provider-swift trick does precisely the opposite.

On Mon, 2008-03-24 at 17:13 -0500, Michael Wilde wrote:
> I just updated to 1756 and I still get same prompt. This is not a big 
> deal, just thought you'd want to know.
> 
> My command was:
> 
>    ant -Dwith-provider-deef redist
> 
> in cog/modules/vdsk
> 
> - Mike
> 
> 
> dist.dir.warning:
>      [input]
>      [input] 
> ======================================================================================
>      [input] Warning! The specified target directory 
> (/radix-homes01/wilde/swift/src/cog/modules/vdsk/../..//modules/vdsk/dist/vdsk-svn) 
> does not seem to contain a Swift build.
>      [input] Press Return to continue with the build or CTRL+C to abort...
>      [input] 
> ======================================================================================
>      [input]
> 
> 
> On 3/18/08 8:40 PM, Ben Clifford wrote:
> >>     [input] Warning! The specified target directory
> >> (/radix-homes01/wilde/swift/src/cog/modules/vdsk/../..//modules/vdsk/dist/vdsk-svn)
> >> does not seem to contain a Swift build.
> >>     [input] Press Return to continue with the build or CTRL+C to abort...
> > 
> > 
> > As of r1738 (to provider-deef) this does not happen any more.
> > 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From benc at hawaga.org.uk  Mon Mar 24 17:22:30 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 24 Mar 2008 22:22:30 +0000 (GMT)
Subject: [Swift-devel] Why build prompts in redist?
In-Reply-To: <47E8276D.1020809@mcs.anl.gov>
References: <47E0422D.9000707@mcs.anl.gov>
	<Pine.LNX.4.64.0803190139490.5372@dildano.hawaga.org.uk>
	<47E8276D.1020809@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0803242222120.28951@dildano.hawaga.org.uk>


can you type svn info in cog/modules/provider-deef ?

On Mon, 24 Mar 2008, Michael Wilde wrote:

> I just updated to 1756 and I still get same prompt. This is not a big deal,
> just thought you'd want to know.
> 
> My command was:
> 
>   ant -Dwith-provider-deef redist
> 
> in cog/modules/vdsk
> 
> - Mike
> 
> 
> dist.dir.warning:
>     [input]
>     [input]
> ======================================================================================
>     [input] Warning! The specified target directory
> (/radix-homes01/wilde/swift/src/cog/modules/vdsk/../..//modules/vdsk/dist/vdsk-svn)
> does not seem to contain a Swift build.
>     [input] Press Return to continue with the build or CTRL+C to abort...
>     [input]
> ======================================================================================
>     [input]
> 
> 
> On 3/18/08 8:40 PM, Ben Clifford wrote:
> > >     [input] Warning! The specified target directory
> > > (/radix-homes01/wilde/swift/src/cog/modules/vdsk/../..//modules/vdsk/dist/vdsk-svn)
> > > does not seem to contain a Swift build.
> > >     [input] Press Return to continue with the build or CTRL+C to abort...
> > 
> > 
> > As of r1738 (to provider-deef) this does not happen any more.
> > 
> 
> 


From hategan at mcs.anl.gov  Mon Mar 24 17:22:59 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Mon, 24 Mar 2008 17:22:59 -0500
Subject: [Swift-devel] Why build prompts in redist?
In-Reply-To: <1206397214.25801.1.camel@blabla.mcs.anl.gov>
References: <47E0422D.9000707@mcs.anl.gov>
	<Pine.LNX.4.64.0803190139490.5372@dildano.hawaga.org.uk>
	<47E8276D.1020809@mcs.anl.gov>
	<1206397214.25801.1.camel@blabla.mcs.anl.gov>
Message-ID: <1206397379.26235.0.camel@blabla.mcs.anl.gov>

However, given that Ben removed that warning entirely, causes me to
believe that your provider-deef isn't up to date.

On Mon, 2008-03-24 at 17:20 -0500, Mihael Hategan wrote:
> Seems quite obvious, given that the warning was meant to deal with a
> situation where one would build provider-deef after building swift,
> whereas the -Dwith-provider-swift trick does precisely the opposite.
> 
> On Mon, 2008-03-24 at 17:13 -0500, Michael Wilde wrote:
> > I just updated to 1756 and I still get same prompt. This is not a big 
> > deal, just thought you'd want to know.
> > 
> > My command was:
> > 
> >    ant -Dwith-provider-deef redist
> > 
> > in cog/modules/vdsk
> > 
> > - Mike
> > 
> > 
> > dist.dir.warning:
> >      [input]
> >      [input] 
> > ======================================================================================
> >      [input] Warning! The specified target directory 
> > (/radix-homes01/wilde/swift/src/cog/modules/vdsk/../..//modules/vdsk/dist/vdsk-svn) 
> > does not seem to contain a Swift build.
> >      [input] Press Return to continue with the build or CTRL+C to abort...
> >      [input] 
> > ======================================================================================
> >      [input]
> > 
> > 
> > On 3/18/08 8:40 PM, Ben Clifford wrote:
> > >>     [input] Warning! The specified target directory
> > >> (/radix-homes01/wilde/swift/src/cog/modules/vdsk/../..//modules/vdsk/dist/vdsk-svn)
> > >> does not seem to contain a Swift build.
> > >>     [input] Press Return to continue with the build or CTRL+C to abort...
> > > 
> > > 
> > > As of r1738 (to provider-deef) this does not happen any more.
> > > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 


From benc at hawaga.org.uk  Mon Mar 24 18:36:45 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 24 Mar 2008 23:36:45 +0000 (GMT)
Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to
	explain plateaus...]
In-Reply-To: <1206381792.11561.1.camel@blabla.mcs.anl.gov>
References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov>
	<47E7E684.5070303@cs.uchicago.edu> <47E7EC03.1090609@mcs.anl.gov>
	<1206381792.11561.1.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0803242335510.5372@dildano.hawaga.org.uk>


On Mon, 24 Mar 2008, Mihael Hategan wrote:

> As far as I can remember, Ben added fairly comprehensive logging to the
> wrapper. That may shed some light on the issue.

Indeed I did; and that logging information can be sent back to the submit 
host by enabling wrapperlog.always.transfer=true

-- 


From wilde at mcs.anl.gov  Mon Mar 24 19:30:14 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 24 Mar 2008 19:30:14 -0500
Subject: [Swift-devel] Why build prompts in redist?
In-Reply-To: <1206397379.26235.0.camel@blabla.mcs.anl.gov>
References: <47E0422D.9000707@mcs.anl.gov>	
	<Pine.LNX.4.64.0803190139490.5372@dildano.hawaga.org.uk>	
	<47E8276D.1020809@mcs.anl.gov>	
	<1206397214.25801.1.camel@blabla.mcs.anl.gov>
	<1206397379.26235.0.camel@blabla.mcs.anl.gov>
Message-ID: <47E84796.30307@mcs.anl.gov>

It was - thanks. I missed Ben's initial note "As of r1738 (to
provider-deef)". That fixed it.

- Mike

On 3/24/08 5:22 PM, Mihael Hategan wrote:
> However, given that Ben removed that warning entirely, causes me to
> believe that your provider-deef isn't up to date.
> 
> On Mon, 2008-03-24 at 17:20 -0500, Mihael Hategan wrote:
>> Seems quite obvious, given that the warning was meant to deal with a
>> situation where one would build provider-deef after building swift,
>> whereas the -Dwith-provider-swift trick does precisely the opposite.
>>
>> On Mon, 2008-03-24 at 17:13 -0500, Michael Wilde wrote:
>>> I just updated to 1756 and I still get same prompt. This is not a big 
>>> deal, just thought you'd want to know.
>>>
>>> My command was:
>>>
>>>    ant -Dwith-provider-deef redist
>>>
>>> in cog/modules/vdsk
>>>
>>> - Mike
>>>
>>>
>>> dist.dir.warning:
>>>      [input]
>>>      [input] 
>>> ======================================================================================
>>>      [input] Warning! The specified target directory 
>>> (/radix-homes01/wilde/swift/src/cog/modules/vdsk/../..//modules/vdsk/dist/vdsk-svn) 
>>> does not seem to contain a Swift build.
>>>      [input] Press Return to continue with the build or CTRL+C to abort...
>>>      [input] 
>>> ======================================================================================
>>>      [input]
>>>
>>>
>>> On 3/18/08 8:40 PM, Ben Clifford wrote:
>>>>>     [input] Warning! The specified target directory
>>>>> (/radix-homes01/wilde/swift/src/cog/modules/vdsk/../..//modules/vdsk/dist/vdsk-svn)
>>>>> does not seem to contain a Swift build.
>>>>>     [input] Press Return to continue with the build or CTRL+C to abort...
>>>>
>>>> As of r1738 (to provider-deef) this does not happen any more.
>>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>
> 
> 


From wilde at mcs.anl.gov  Mon Mar 24 22:15:38 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Mon, 24 Mar 2008 22:15:38 -0500
Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to
	explain plateaus...]
In-Reply-To: <Pine.LNX.4.64.0803242335510.5372@dildano.hawaga.org.uk>
References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov>
	<47E7E684.5070303@cs.uchicago.edu> <47E7EC03.1090609@mcs.anl.gov>
	<1206381792.11561.1.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0803242335510.5372@dildano.hawaga.org.uk>
Message-ID: <47E86E5A.6080303@mcs.anl.gov>

Ben, do you have a script to sum the time spent per step of wrapper.sh, 
over a set in -info files?

On 3/24/08 6:36 PM, Ben Clifford wrote:
> On Mon, 24 Mar 2008, Mihael Hategan wrote:
> 
>> As far as I can remember, Ben added fairly comprehensive logging to the
>> wrapper. That may shed some light on the issue.
> 
> Indeed I did; and that logging information can be sent back to the submit 
> host by enabling wrapperlog.always.transfer=true
> 


From iraicu at cs.uchicago.edu  Mon Mar 24 22:21:39 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Mon, 24 Mar 2008 22:21:39 -0500
Subject: [Swift-devel] Re: swift-falkon problem... plots to explain
	plateaus...
In-Reply-To: <Pine.LNX.4.64.0803242021160.28951@dildano.hawaga.org.uk>
References: <47DDBFD7.2050700@mcs.anl.gov>
	<Pine.LNX.4.64.0803171956000.28951@dildano.hawaga.org.uk>
	<47DED24B.8070900@mcs.anl.gov> <47DFCC33.30803@mcs.anl.gov>
	<Pine.LNX.4.64.0803181849420.5372@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803181854470.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803182044440.28951@dildano.hawaga.org.uk>
	<47E1352E.3070808@mcs.anl.gov>
	<Pine.LNX.4.64.0803191746080.28951@dildano.hawaga.org.uk>
	<47E15F6A.5080109@mcs.anl.gov>
	<Pine.LNX.4.64.0803192116000.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803202318120.28951@dildano.hawaga.org.uk>
	<47E3A613.2040701@mcs.anl.gov>
	<1206108150.5100.0.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0803240010010.28951@dildano.hawaga.org.uk>
	<47E718F8.6010800@mcs.anl.gov>
	<Pine.LNX.4.64.0803240925060.28951@dildano.hawaga.org.uk>
	<Pine.LNX.4.64.0803240941280.28951@dildano.hawaga.org.uk>
	<47E7B27D.6050606@cs.uchicago.edu>
	<47E7DB50.2020407@cs.uchicago.edu>
	<Pine.LNX.4.64.0803242021160.28951@dildano.hawaga.org.uk>
Message-ID: <47E86FC3.80904@cs.uchicago.edu>


Ben Clifford wrote:
> On Mon, 24 Mar 2008, Ioan Raicu wrote:
>
>   
>> start completing, and new ones get dispatched.  At around 132 seconds, the
>> wait queue is empty, and some workers start becoming idle (the red area)... by
>>     
>
> Ideally swift would be keeping the queue full until there is nothing left 
> to send. There shouldn't be two distinct 600 and 400 job bursts. But I 
> guess that may be because the job throttling isn't set to a large enough 
> infinity.
>   
Its not the throttling... Swift sent all 1000 tasks at once (all within 
the first 10 seconds).  There were 600 workers running on 600 CPUs, so 
600 (of the 1000) tasks went from the wait queue to the running state, 
and there were 400 tasks left in the wait queue.  After some time, the 
first round of tasks (the first 600) completed and the second round of 
tasks (400) went from the wait queue to the running state.  So, the two 
distinct rounds, 600 then 400 is because of the 600 CPUs and 1000 total 
tasks...

Ioan


-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080324/4bfe6e89/attachment.html>

From wilde at mcs.anl.gov  Tue Mar 25 00:28:44 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 25 Mar 2008 00:28:44 -0500
Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to
	explain plateaus...]
In-Reply-To: <47E86E5A.6080303@mcs.anl.gov>
References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov>
	<47E7E684.5070303@cs.uchicago.edu> <47E7EC03.1090609@mcs.anl.gov>
	<1206381792.11561.1.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0803242335510.5372@dildano.hawaga.org.uk>
	<47E86E5A.6080303@mcs.anl.gov>
Message-ID: <47E88D8C.4090207@mcs.anl.gov>

I eyeballed the wrapperlogs to get a rough idea of what was happening.

I ran with wrapperlog saving and no other changes for wf's of 10, 100 
and 500 jobs, to see how the exec time grew.  At 500 jobs it grew to 
about 30+ seconds for a core app exec time of about 1 sec. (Im just 
recollecting the times as at this point I didnt write much down).

First results showed more time spent in the app wrapper than in 
wrapper.sh.  I remedied this by using /tmp as the app-wrapper's working 
dir, and caching the app binary on /tmp.  This brought a 20+ sec app 
exec time down to about 3 seconds.

With this fixed, the total time in wrapper.sh including the app is now 
about 15 seconds, with 3 being in the app-wrapper itself. The time seems 
about evenly spread over the several wrapper.sh operations, which is not 
surprising when 500 wrappers hit NFS all at once.

I then tried 3 more tests:
- a run to see if the app-executable caching on /tmp had an effect
   (it didnt)
- a run to see if turning of wrapperlog retrieval had an effect
- a run with data operation throttles (both) set to 100 from 10

None of these last three things had a significant effect.

Tomorrow I will try some mods to the wrapper script. Turning off wrapper 
logging in a previous trial yesterday *seemed* to shave 20-30% off the 
run time.  I need to verify this.

I'm also going to try to use /tmp for the jobdir and reduce wrapper.sh 
overhead; also will leave the (tiny) job output on /tmp for later 
aggregation (will have some swift questions on that).

Ben, if you want to look at any of these logs, the runs are in 
swift-logs/wilde in the order described above (w/comment files):

346: 10 job workflow
347: 100 job wf
348: 500 job wf
349: 500 jobs w/ improved app-wrapper
350: 500 jobs w/ improved app-wrapper & executable on /tmp
351: 500 jobs, wrapperlog saving off
352: 500 jobs, wrapperlog saving off, data throttles at 100 (from 20)

All but the first of these should have falkon logs saved as well.

I have several ideas on how to proceed, but welcome advice and any 
discoveries from log analysis.

Thanks,

Mike


On 3/24/08 10:15 PM, Michael Wilde wrote:
> Ben, do you have a script to sum the time spent per step of wrapper.sh, 
> over a set in -info files?
> 
> On 3/24/08 6:36 PM, Ben Clifford wrote:
>> On Mon, 24 Mar 2008, Mihael Hategan wrote:
>>
>>> As far as I can remember, Ben added fairly comprehensive logging to the
>>> wrapper. That may shed some light on the issue.
>>
>> Indeed I did; and that logging information can be sent back to the 
>> submit host by enabling wrapperlog.always.transfer=true
>>
> 


From hategan at mcs.anl.gov  Tue Mar 25 03:31:19 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 25 Mar 2008 03:31:19 -0500
Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to
	explain plateaus...]
In-Reply-To: <47E88D8C.4090207@mcs.anl.gov>
References: <47E7DC3D.6040704@cs.uchicago.edu>
	<47E7E30A.2020208@mcs.anl.gov> <47E7E684.5070303@cs.uchicago.edu>
	<47E7EC03.1090609@mcs.anl.gov>
	<1206381792.11561.1.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0803242335510.5372@dildano.hawaga.org.uk>
	<47E86E5A.6080303@mcs.anl.gov>  <47E88D8C.4090207@mcs.anl.gov>
Message-ID: <1206433879.26701.0.camel@blabla.mcs.anl.gov>


On Tue, 2008-03-25 at 00:28 -0500, Michael Wilde wrote:
> I eyeballed the wrapperlogs to get a rough idea of what was happening.
> 
> I ran with wrapperlog saving and no other changes for wf's of 10, 100 
> and 500 jobs, to see how the exec time grew.  At 500 jobs it grew to 
> about 30+ seconds for a core app exec time of about 1 sec. (Im just 
> recollecting the times as at this point I didnt write much down).
> 

I would personally like to see those logs.


From wilde at mcs.anl.gov  Tue Mar 25 08:16:07 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 25 Mar 2008 08:16:07 -0500
Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to
	explain plateaus...]
In-Reply-To: <1206433879.26701.0.camel@blabla.mcs.anl.gov>
References: <47E7DC3D.6040704@cs.uchicago.edu>	 <47E7E30A.2020208@mcs.anl.gov>
	<47E7E684.5070303@cs.uchicago.edu>	 <47E7EC03.1090609@mcs.anl.gov>	
	<1206381792.11561.1.camel@blabla.mcs.anl.gov>	
	<Pine.LNX.4.64.0803242335510.5372@dildano.hawaga.org.uk>	
	<47E86E5A.6080303@mcs.anl.gov> <47E88D8C.4090207@mcs.anl.gov>
	<1206433879.26701.0.camel@blabla.mcs.anl.gov>
Message-ID: <47E8FB17.1080501@mcs.anl.gov>


On 3/25/08 3:31 AM, Mihael Hategan wrote:
> On Tue, 2008-03-25 at 00:28 -0500, Michael Wilde wrote:
>> I eyeballed the wrapperlogs to get a rough idea of what was happening.
>>
>> I ran with wrapperlog saving and no other changes for wf's of 10, 100 
>> and 500 jobs, to see how the exec time grew.  At 500 jobs it grew to 
>> about 30+ seconds for a core app exec time of about 1 sec. (Im just 
>> recollecting the times as at this point I didnt write much down).
>>
> 
> I would personally like to see those logs.

I listed all the runs in the previous mail (below), Mihael. They are on 
CI NFS at ~benc/swift-logs/wilde/run{345-350}. Let us know what you find.

Thanks,

- Mike

On 3/25/08 12:28 AM, Michael Wilde wrote:
 > I eyeballed the wrapperlogs to get a rough idea of what was happening.
 >
 > I ran with wrapperlog saving and no other changes for wf's of 10, 100
 > and 500 jobs, to see how the exec time grew.  At 500 jobs it grew to
 > about 30+ seconds for a core app exec time of about 1 sec. (Im just
 > recollecting the times as at this point I didnt write much down).
 >
 > First results showed more time spent in the app wrapper than in
 > wrapper.sh.  I remedied this by using /tmp as the app-wrapper's working
 > dir, and caching the app binary on /tmp.  This brought a 20+ sec app
 > exec time down to about 3 seconds.
 >
 > With this fixed, the total time in wrapper.sh including the app is now
 > about 15 seconds, with 3 being in the app-wrapper itself. The time seems
 > about evenly spread over the several wrapper.sh operations, which is not
 > surprising when 500 wrappers hit NFS all at once.
 >
 > I then tried 3 more tests:
 > - a run to see if the app-executable caching on /tmp had an effect
 >   (it didnt)
 > - a run to see if turning of wrapperlog retrieval had an effect
 > - a run with data operation throttles (both) set to 100 from 10
 >
 > None of these last three things had a significant effect.
 >
 > Tomorrow I will try some mods to the wrapper script. Turning off wrapper
 > logging in a previous trial yesterday *seemed* to shave 20-30% off the
 > run time.  I need to verify this.
 >
 > I'm also going to try to use /tmp for the jobdir and reduce wrapper.sh
 > overhead; also will leave the (tiny) job output on /tmp for later
 > aggregation (will have some swift questions on that).
 >
 > Ben, if you want to look at any of these logs, the runs are in
 > swift-logs/wilde in the order described above (w/comment files):
 >
 > 346: 10 job workflow
 > 347: 100 job wf
 > 348: 500 job wf
 > 349: 500 jobs w/ improved app-wrapper
 > 350: 500 jobs w/ improved app-wrapper & executable on /tmp
 > 351: 500 jobs, wrapperlog saving off
 > 352: 500 jobs, wrapperlog saving off, data throttles at 100 (from 20)
 >
 > All but the first of these should have falkon logs saved as well.
 >
 > I have several ideas on how to proceed, but welcome advice and any
 > discoveries from log analysis.
 >
 > Thanks,
 >
 > Mike
 >
 >
 > On 3/24/08 10:15 PM, Michael Wilde wrote:
 >> Ben, do you have a script to sum the time spent per step of
 >> wrapper.sh, over a set in -info files?
 >>
 >> On 3/24/08 6:36 PM, Ben Clifford wrote:
 >>> On Mon, 24 Mar 2008, Mihael Hategan wrote:
 >>>
 >>>> As far as I can remember, Ben added fairly comprehensive logging 
to the
 >>>> wrapper. That may shed some light on the issue.
 >>>
 >>> Indeed I did; and that logging information can be sent back to the
 >>> submit host by enabling wrapperlog.always.transfer=true
 >>>
 >>
 >


From benc at hawaga.org.uk  Tue Mar 25 08:22:27 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 25 Mar 2008 13:22:27 +0000 (GMT)
Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to
	explain plateaus...]
In-Reply-To: <47E86E5A.6080303@mcs.anl.gov>
References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov>
	<47E7E684.5070303@cs.uchicago.edu> <47E7EC03.1090609@mcs.anl.gov>
	<1206381792.11561.1.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0803242335510.5372@dildano.hawaga.org.uk>
	<47E86E5A.6080303@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0803251320120.28951@dildano.hawaga.org.uk>


On Mon, 24 Mar 2008, Michael Wilde wrote:

> Ben, do you have a script to sum the time spent per step of wrapper.sh, over a
> set in -info files?

No. But possibly a similar summary can be given visually by the -info 
graphs that are in the present log procesing code. Make your wrapper logs 
online and we can look at them.

-- 


From hategan at mcs.anl.gov  Tue Mar 25 08:34:43 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 25 Mar 2008 08:34:43 -0500
Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to
	explain plateaus...]
In-Reply-To: <47E8FB17.1080501@mcs.anl.gov>
References: <47E7DC3D.6040704@cs.uchicago.edu>
	<47E7E30A.2020208@mcs.anl.gov> <47E7E684.5070303@cs.uchicago.edu>
	<47E7EC03.1090609@mcs.anl.gov>
	<1206381792.11561.1.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0803242335510.5372@dildano.hawaga.org.uk>
	<47E86E5A.6080303@mcs.anl.gov>  <47E88D8C.4090207@mcs.anl.gov>
	<1206433879.26701.0.camel@blabla.mcs.anl.gov>
	<47E8FB17.1080501@mcs.anl.gov>
Message-ID: <1206452083.31476.12.camel@blabla.mcs.anl.gov>


On Tue, 2008-03-25 at 08:16 -0500, Michael Wilde wrote:
> On 3/25/08 3:31 AM, Mihael Hategan wrote:
> > On Tue, 2008-03-25 at 00:28 -0500, Michael Wilde wrote:
> >> I eyeballed the wrapperlogs to get a rough idea of what was happening.
> >>
> >> I ran with wrapperlog saving and no other changes for wf's of 10, 100 
> >> and 500 jobs, to see how the exec time grew.  At 500 jobs it grew to 
> >> about 30+ seconds for a core app exec time of about 1 sec. (Im just 
> >> recollecting the times as at this point I didnt write much down).
> >>
> > 
> > I would personally like to see those logs.
> 
> I listed all the runs in the previous mail (below), Mihael. They are on 
> CI NFS at ~benc/swift-logs/wilde/run{345-350}.

Sorry about that.

>  Let us know what you find.
> 

It looks like this:
- 5 seconds between LOG_START and CREATE_JOBDIR. Likely hogs:
mkdir -p $WFDIR/info/$JOBDIR
mkdir -p $WFDIR/status/$JOBDIR
and the creation of the info file.
- 2.5 seconds between CREATE_JOBDIR and CREATE_INPUTDIR. Likely problem:
mkdir -p $DIR
(on a very fuzzy note, if one mkdir takes 2.5 seconds, two will take 5,
which seems to roughly fit the observed numbers).
- 3.5 seconds for COPYING_OUTPUTS
- 2.5 seconds for RM_JOBDIR

I'd be curious to know how much of the time is actually spent writing to
the logs. That's because I see one second between EXECUTE_DONE and
COPYING_OUTPUTS, a place where the only meaningful things that are done
are two log messages.

Perhaps it may be useful to run the whole thing through strace -T.

Mihael


From wilde at mcs.anl.gov  Tue Mar 25 08:44:40 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 25 Mar 2008 08:44:40 -0500
Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to
	explain plateaus...]
In-Reply-To: <1206452083.31476.12.camel@blabla.mcs.anl.gov>
References: <47E7DC3D.6040704@cs.uchicago.edu>	 <47E7E30A.2020208@mcs.anl.gov>
	<47E7E684.5070303@cs.uchicago.edu>	 <47E7EC03.1090609@mcs.anl.gov>	
	<1206381792.11561.1.camel@blabla.mcs.anl.gov>	
	<Pine.LNX.4.64.0803242335510.5372@dildano.hawaga.org.uk>	
	<47E86E5A.6080303@mcs.anl.gov> <47E88D8C.4090207@mcs.anl.gov>	
	<1206433879.26701.0.camel@blabla.mcs.anl.gov>	
	<47E8FB17.1080501@mcs.anl.gov>
	<1206452083.31476.12.camel@blabla.mcs.anl.gov>
Message-ID: <47E901C8.1060106@mcs.anl.gov>

I did runs the day before with a modified wrapper that bypassed the INFO 
logging. It saved a good amount - I recall about 30% but need to 
re-check the numbers.

Yes, I came to the same conclusion on the mkdirs.  Im looking at 
reducing these, likely moving the jobdir to /tmp.  I think I can do that 
within the current structure.  wrapper.sh is ver clear and nicely 
written. (Ben: yes, eyeballing the log #s was easy and no problem).

First thing I want to do, though, is run some large scale tests on our 
two science workflows, increasing the petro-modelling one (the 
sub-second application) to a larger runtime through app-level batching.

Zhao's latest test indicate that if we do batches of 40, bringing the 
jobs from .5 sec to 20 sec, we can saturate the BGP's 4K cores and keep 
it running efficiently. Given the extra wrapper.sh overhead, I might 
need to increase that another 10X, but once the app is wrapped in a 
loop, it makes little difference to the user how big we make that.

The other app is a molecule-docking app, that can be batched similarly.

Once we get those running nicely at a larger, less brutal job time, I'll 
come back to wrapper.sh tuning.  If you or Ben want to do this in the 
meantime, though, that would be great.  We have the use-local-disk 
scenario on our development stack anyways - this would be a good time to 
do it.  If I do it, it will be only a prototype for measurement purposes.

Mike


On 3/25/08 8:34 AM, Mihael Hategan wrote:
> On Tue, 2008-03-25 at 08:16 -0500, Michael Wilde wrote:
>> On 3/25/08 3:31 AM, Mihael Hategan wrote:
>>> On Tue, 2008-03-25 at 00:28 -0500, Michael Wilde wrote:
>>>> I eyeballed the wrapperlogs to get a rough idea of what was happening.
>>>>
>>>> I ran with wrapperlog saving and no other changes for wf's of 10, 100 
>>>> and 500 jobs, to see how the exec time grew.  At 500 jobs it grew to 
>>>> about 30+ seconds for a core app exec time of about 1 sec. (Im just 
>>>> recollecting the times as at this point I didnt write much down).
>>>>
>>> I would personally like to see those logs.
>> I listed all the runs in the previous mail (below), Mihael. They are on 
>> CI NFS at ~benc/swift-logs/wilde/run{345-350}.
> 
> Sorry about that.
> 
>>  Let us know what you find.
>>
> 
> It looks like this:
> - 5 seconds between LOG_START and CREATE_JOBDIR. Likely hogs:
> mkdir -p $WFDIR/info/$JOBDIR
> mkdir -p $WFDIR/status/$JOBDIR
> and the creation of the info file.
> - 2.5 seconds between CREATE_JOBDIR and CREATE_INPUTDIR. Likely problem:
> mkdir -p $DIR
> (on a very fuzzy note, if one mkdir takes 2.5 seconds, two will take 5,
> which seems to roughly fit the observed numbers).
> - 3.5 seconds for COPYING_OUTPUTS
> - 2.5 seconds for RM_JOBDIR
> 
> I'd be curious to know how much of the time is actually spent writing to
> the logs. That's because I see one second between EXECUTE_DONE and
> COPYING_OUTPUTS, a place where the only meaningful things that are done
> are two log messages.
> 
> Perhaps it may be useful to run the whole thing through strace -T.
> 
> Mihael
> 
> 


From hategan at mcs.anl.gov  Tue Mar 25 09:32:54 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 25 Mar 2008 09:32:54 -0500
Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to
	explain plateaus...]
In-Reply-To: <47E901C8.1060106@mcs.anl.gov>
References: <47E7DC3D.6040704@cs.uchicago.edu>
	<47E7E30A.2020208@mcs.anl.gov> <47E7E684.5070303@cs.uchicago.edu>
	<47E7EC03.1090609@mcs.anl.gov>
	<1206381792.11561.1.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0803242335510.5372@dildano.hawaga.org.uk>
	<47E86E5A.6080303@mcs.anl.gov>  <47E88D8C.4090207@mcs.anl.gov>
	<1206433879.26701.0.camel@blabla.mcs.anl.gov>
	<47E8FB17.1080501@mcs.anl.gov>
	<1206452083.31476.12.camel@blabla.mcs.anl.gov>
	<47E901C8.1060106@mcs.anl.gov>
Message-ID: <1206455574.31476.15.camel@blabla.mcs.anl.gov>

Problem may be that, as a quick test shows, bash opens and closes the
info file every time a redirect is done.

On Tue, 2008-03-25 at 08:44 -0500, Michael Wilde wrote:
> I did runs the day before with a modified wrapper that bypassed the INFO 
> logging. It saved a good amount - I recall about 30% but need to 
> re-check the numbers.
> 
> Yes, I came to the same conclusion on the mkdirs.  Im looking at 
> reducing these, likely moving the jobdir to /tmp.  I think I can do that 
> within the current structure.  wrapper.sh is ver clear and nicely 
> written. (Ben: yes, eyeballing the log #s was easy and no problem).
> 
> First thing I want to do, though, is run some large scale tests on our 
> two science workflows, increasing the petro-modelling one (the 
> sub-second application) to a larger runtime through app-level batching.
> 
> Zhao's latest test indicate that if we do batches of 40, bringing the 
> jobs from .5 sec to 20 sec, we can saturate the BGP's 4K cores and keep 
> it running efficiently. Given the extra wrapper.sh overhead, I might 
> need to increase that another 10X, but once the app is wrapped in a 
> loop, it makes little difference to the user how big we make that.
> 
> The other app is a molecule-docking app, that can be batched similarly.
> 
> Once we get those running nicely at a larger, less brutal job time, I'll 
> come back to wrapper.sh tuning.  If you or Ben want to do this in the 
> meantime, though, that would be great.  We have the use-local-disk 
> scenario on our development stack anyways - this would be a good time to 
> do it.  If I do it, it will be only a prototype for measurement purposes.
> 
> Mike
> 
> 
> 
> 
> On 3/25/08 8:34 AM, Mihael Hategan wrote:
> > On Tue, 2008-03-25 at 08:16 -0500, Michael Wilde wrote:
> >> On 3/25/08 3:31 AM, Mihael Hategan wrote:
> >>> On Tue, 2008-03-25 at 00:28 -0500, Michael Wilde wrote:
> >>>> I eyeballed the wrapperlogs to get a rough idea of what was happening.
> >>>>
> >>>> I ran with wrapperlog saving and no other changes for wf's of 10, 100 
> >>>> and 500 jobs, to see how the exec time grew.  At 500 jobs it grew to 
> >>>> about 30+ seconds for a core app exec time of about 1 sec. (Im just 
> >>>> recollecting the times as at this point I didnt write much down).
> >>>>
> >>> I would personally like to see those logs.
> >> I listed all the runs in the previous mail (below), Mihael. They are on 
> >> CI NFS at ~benc/swift-logs/wilde/run{345-350}.
> > 
> > Sorry about that.
> > 
> >>  Let us know what you find.
> >>
> > 
> > It looks like this:
> > - 5 seconds between LOG_START and CREATE_JOBDIR. Likely hogs:
> > mkdir -p $WFDIR/info/$JOBDIR
> > mkdir -p $WFDIR/status/$JOBDIR
> > and the creation of the info file.
> > - 2.5 seconds between CREATE_JOBDIR and CREATE_INPUTDIR. Likely problem:
> > mkdir -p $DIR
> > (on a very fuzzy note, if one mkdir takes 2.5 seconds, two will take 5,
> > which seems to roughly fit the observed numbers).
> > - 3.5 seconds for COPYING_OUTPUTS
> > - 2.5 seconds for RM_JOBDIR
> > 
> > I'd be curious to know how much of the time is actually spent writing to
> > the logs. That's because I see one second between EXECUTE_DONE and
> > COPYING_OUTPUTS, a place where the only meaningful things that are done
> > are two log messages.
> > 
> > Perhaps it may be useful to run the whole thing through strace -T.
> > 
> > Mihael
> > 
> > 
> 


From wilde at mcs.anl.gov  Tue Mar 25 10:06:46 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 25 Mar 2008 10:06:46 -0500
Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to
	explain plateaus...]
In-Reply-To: <1206455574.31476.15.camel@blabla.mcs.anl.gov>
References: <47E7DC3D.6040704@cs.uchicago.edu>	 <47E7E30A.2020208@mcs.anl.gov>
	<47E7E684.5070303@cs.uchicago.edu>	 <47E7EC03.1090609@mcs.anl.gov>	
	<1206381792.11561.1.camel@blabla.mcs.anl.gov>	
	<Pine.LNX.4.64.0803242335510.5372@dildano.hawaga.org.uk>	
	<47E86E5A.6080303@mcs.anl.gov> <47E88D8C.4090207@mcs.anl.gov>	
	<1206433879.26701.0.camel@blabla.mcs.anl.gov>	
	<47E8FB17.1080501@mcs.anl.gov>	
	<1206452083.31476.12.camel@blabla.mcs.anl.gov>	
	<47E901C8.1060106@mcs.anl.gov>
	<1206455574.31476.15.camel@blabla.mcs.anl.gov>
Message-ID: <47E91506.2080100@mcs.anl.gov>

One thing I'll test is generating the info file on /tmp, and moving it 
when done to the final job dir.

I can see adjusting wrapper.sh to go from very light to very logged with 
a few increments in the middle that would be most useful.

The main option I think we want to leave for users to toggle in common 
usage, is whether to run the app with its jobdir on local disk, 
typically below /tmp, or on shared disk.  The user would decide based on 
the job's I/O profile and on local disk space availability.

Also, I recall some discussion on the success file. Thats acceptable 
overhead for all but the tiniest of jobs, but when a BGP is eventually 
running 100K+ short jobs at once, the rate of success file creation 
could become a bottleneck. Seems like we could have an option that 
avoids creating and expecting the success file if that proved useful - 
need to measure.

- Mike


On 3/25/08 9:32 AM, Mihael Hategan wrote:
> Problem may be that, as a quick test shows, bash opens and closes the
> info file every time a redirect is done.
> 
> On Tue, 2008-03-25 at 08:44 -0500, Michael Wilde wrote:
>> I did runs the day before with a modified wrapper that bypassed the INFO 
>> logging. It saved a good amount - I recall about 30% but need to 
>> re-check the numbers.
>>
>> Yes, I came to the same conclusion on the mkdirs.  Im looking at 
>> reducing these, likely moving the jobdir to /tmp.  I think I can do that 
>> within the current structure.  wrapper.sh is ver clear and nicely 
>> written. (Ben: yes, eyeballing the log #s was easy and no problem).
>>
>> First thing I want to do, though, is run some large scale tests on our 
>> two science workflows, increasing the petro-modelling one (the 
>> sub-second application) to a larger runtime through app-level batching.
>>
>> Zhao's latest test indicate that if we do batches of 40, bringing the 
>> jobs from .5 sec to 20 sec, we can saturate the BGP's 4K cores and keep 
>> it running efficiently. Given the extra wrapper.sh overhead, I might 
>> need to increase that another 10X, but once the app is wrapped in a 
>> loop, it makes little difference to the user how big we make that.
>>
>> The other app is a molecule-docking app, that can be batched similarly.
>>
>> Once we get those running nicely at a larger, less brutal job time, I'll 
>> come back to wrapper.sh tuning.  If you or Ben want to do this in the 
>> meantime, though, that would be great.  We have the use-local-disk 
>> scenario on our development stack anyways - this would be a good time to 
>> do it.  If I do it, it will be only a prototype for measurement purposes.
>>
>> Mike
>>
>>
>>
>>
>> On 3/25/08 8:34 AM, Mihael Hategan wrote:
>>> On Tue, 2008-03-25 at 08:16 -0500, Michael Wilde wrote:
>>>> On 3/25/08 3:31 AM, Mihael Hategan wrote:
>>>>> On Tue, 2008-03-25 at 00:28 -0500, Michael Wilde wrote:
>>>>>> I eyeballed the wrapperlogs to get a rough idea of what was happening.
>>>>>>
>>>>>> I ran with wrapperlog saving and no other changes for wf's of 10, 100 
>>>>>> and 500 jobs, to see how the exec time grew.  At 500 jobs it grew to 
>>>>>> about 30+ seconds for a core app exec time of about 1 sec. (Im just 
>>>>>> recollecting the times as at this point I didnt write much down).
>>>>>>
>>>>> I would personally like to see those logs.
>>>> I listed all the runs in the previous mail (below), Mihael. They are on 
>>>> CI NFS at ~benc/swift-logs/wilde/run{345-350}.
>>> Sorry about that.
>>>
>>>>  Let us know what you find.
>>>>
>>> It looks like this:
>>> - 5 seconds between LOG_START and CREATE_JOBDIR. Likely hogs:
>>> mkdir -p $WFDIR/info/$JOBDIR
>>> mkdir -p $WFDIR/status/$JOBDIR
>>> and the creation of the info file.
>>> - 2.5 seconds between CREATE_JOBDIR and CREATE_INPUTDIR. Likely problem:
>>> mkdir -p $DIR
>>> (on a very fuzzy note, if one mkdir takes 2.5 seconds, two will take 5,
>>> which seems to roughly fit the observed numbers).
>>> - 3.5 seconds for COPYING_OUTPUTS
>>> - 2.5 seconds for RM_JOBDIR
>>>
>>> I'd be curious to know how much of the time is actually spent writing to
>>> the logs. That's because I see one second between EXECUTE_DONE and
>>> COPYING_OUTPUTS, a place where the only meaningful things that are done
>>> are two log messages.
>>>
>>> Perhaps it may be useful to run the whole thing through strace -T.
>>>
>>> Mihael
>>>
>>>
> 
> 


From hategan at mcs.anl.gov  Tue Mar 25 10:09:06 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 25 Mar 2008 10:09:06 -0500
Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to
	explain plateaus...]
In-Reply-To: <47E91506.2080100@mcs.anl.gov>
References: <47E7DC3D.6040704@cs.uchicago.edu>
	<47E7E30A.2020208@mcs.anl.gov> <47E7E684.5070303@cs.uchicago.edu>
	<47E7EC03.1090609@mcs.anl.gov>
	<1206381792.11561.1.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0803242335510.5372@dildano.hawaga.org.uk>
	<47E86E5A.6080303@mcs.anl.gov>  <47E88D8C.4090207@mcs.anl.gov>
	<1206433879.26701.0.camel@blabla.mcs.anl.gov>
	<47E8FB17.1080501@mcs.anl.gov>
	<1206452083.31476.12.camel@blabla.mcs.anl.gov>
	<47E901C8.1060106@mcs.anl.gov>
	<1206455574.31476.15.camel@blabla.mcs.anl.gov>
	<47E91506.2080100@mcs.anl.gov>
Message-ID: <1206457746.20249.1.camel@blabla.mcs.anl.gov>

I just wrote a version of the wrapper that opens the log in a descriptor
(so opening happens once). I need to test it first, but I'll commit
shortly.

On Tue, 2008-03-25 at 10:06 -0500, Michael Wilde wrote:
> One thing I'll test is generating the info file on /tmp, and moving it 
> when done to the final job dir.
> 
> I can see adjusting wrapper.sh to go from very light to very logged with 
> a few increments in the middle that would be most useful.
> 
> The main option I think we want to leave for users to toggle in common 
> usage, is whether to run the app with its jobdir on local disk, 
> typically below /tmp, or on shared disk.  The user would decide based on 
> the job's I/O profile and on local disk space availability.
> 
> Also, I recall some discussion on the success file. Thats acceptable 
> overhead for all but the tiniest of jobs, but when a BGP is eventually 
> running 100K+ short jobs at once, the rate of success file creation 
> could become a bottleneck. Seems like we could have an option that 
> avoids creating and expecting the success file if that proved useful - 
> need to measure.
> 
> - Mike
> 
> 
> On 3/25/08 9:32 AM, Mihael Hategan wrote:
> > Problem may be that, as a quick test shows, bash opens and closes the
> > info file every time a redirect is done.
> > 
> > On Tue, 2008-03-25 at 08:44 -0500, Michael Wilde wrote:
> >> I did runs the day before with a modified wrapper that bypassed the INFO 
> >> logging. It saved a good amount - I recall about 30% but need to 
> >> re-check the numbers.
> >>
> >> Yes, I came to the same conclusion on the mkdirs.  Im looking at 
> >> reducing these, likely moving the jobdir to /tmp.  I think I can do that 
> >> within the current structure.  wrapper.sh is ver clear and nicely 
> >> written. (Ben: yes, eyeballing the log #s was easy and no problem).
> >>
> >> First thing I want to do, though, is run some large scale tests on our 
> >> two science workflows, increasing the petro-modelling one (the 
> >> sub-second application) to a larger runtime through app-level batching.
> >>
> >> Zhao's latest test indicate that if we do batches of 40, bringing the 
> >> jobs from .5 sec to 20 sec, we can saturate the BGP's 4K cores and keep 
> >> it running efficiently. Given the extra wrapper.sh overhead, I might 
> >> need to increase that another 10X, but once the app is wrapped in a 
> >> loop, it makes little difference to the user how big we make that.
> >>
> >> The other app is a molecule-docking app, that can be batched similarly.
> >>
> >> Once we get those running nicely at a larger, less brutal job time, I'll 
> >> come back to wrapper.sh tuning.  If you or Ben want to do this in the 
> >> meantime, though, that would be great.  We have the use-local-disk 
> >> scenario on our development stack anyways - this would be a good time to 
> >> do it.  If I do it, it will be only a prototype for measurement purposes.
> >>
> >> Mike
> >>
> >>
> >>
> >>
> >> On 3/25/08 8:34 AM, Mihael Hategan wrote:
> >>> On Tue, 2008-03-25 at 08:16 -0500, Michael Wilde wrote:
> >>>> On 3/25/08 3:31 AM, Mihael Hategan wrote:
> >>>>> On Tue, 2008-03-25 at 00:28 -0500, Michael Wilde wrote:
> >>>>>> I eyeballed the wrapperlogs to get a rough idea of what was happening.
> >>>>>>
> >>>>>> I ran with wrapperlog saving and no other changes for wf's of 10, 100 
> >>>>>> and 500 jobs, to see how the exec time grew.  At 500 jobs it grew to 
> >>>>>> about 30+ seconds for a core app exec time of about 1 sec. (Im just 
> >>>>>> recollecting the times as at this point I didnt write much down).
> >>>>>>
> >>>>> I would personally like to see those logs.
> >>>> I listed all the runs in the previous mail (below), Mihael. They are on 
> >>>> CI NFS at ~benc/swift-logs/wilde/run{345-350}.
> >>> Sorry about that.
> >>>
> >>>>  Let us know what you find.
> >>>>
> >>> It looks like this:
> >>> - 5 seconds between LOG_START and CREATE_JOBDIR. Likely hogs:
> >>> mkdir -p $WFDIR/info/$JOBDIR
> >>> mkdir -p $WFDIR/status/$JOBDIR
> >>> and the creation of the info file.
> >>> - 2.5 seconds between CREATE_JOBDIR and CREATE_INPUTDIR. Likely problem:
> >>> mkdir -p $DIR
> >>> (on a very fuzzy note, if one mkdir takes 2.5 seconds, two will take 5,
> >>> which seems to roughly fit the observed numbers).
> >>> - 3.5 seconds for COPYING_OUTPUTS
> >>> - 2.5 seconds for RM_JOBDIR
> >>>
> >>> I'd be curious to know how much of the time is actually spent writing to
> >>> the logs. That's because I see one second between EXECUTE_DONE and
> >>> COPYING_OUTPUTS, a place where the only meaningful things that are done
> >>> are two log messages.
> >>>
> >>> Perhaps it may be useful to run the whole thing through strace -T.
> >>>
> >>> Mihael
> >>>
> >>>
> > 
> > 
> 


From wilde at mcs.anl.gov  Tue Mar 25 10:16:21 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 25 Mar 2008 10:16:21 -0500
Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to
	explain plateaus...]
In-Reply-To: <1206457746.20249.1.camel@blabla.mcs.anl.gov>
References: <47E7DC3D.6040704@cs.uchicago.edu>	 <47E7E30A.2020208@mcs.anl.gov>
	<47E7E684.5070303@cs.uchicago.edu>	 <47E7EC03.1090609@mcs.anl.gov>	
	<1206381792.11561.1.camel@blabla.mcs.anl.gov>	
	<Pine.LNX.4.64.0803242335510.5372@dildano.hawaga.org.uk>	
	<47E86E5A.6080303@mcs.anl.gov> <47E88D8C.4090207@mcs.anl.gov>	
	<1206433879.26701.0.camel@blabla.mcs.anl.gov>	
	<47E8FB17.1080501@mcs.anl.gov>	
	<1206452083.31476.12.camel@blabla.mcs.anl.gov>	
	<47E901C8.1060106@mcs.anl.gov>	
	<1206455574.31476.15.camel@blabla.mcs.anl.gov>	
	<47E91506.2080100@mcs.anl.gov>
	<1206457746.20249.1.camel@blabla.mcs.anl.gov>
Message-ID: <47E91745.7090909@mcs.anl.gov>

Great, thanks Mihael.  Thats a useful step. I'll test.

- Mike

On 3/25/08 10:09 AM, Mihael Hategan wrote:
> I just wrote a version of the wrapper that opens the log in a descriptor
> (so opening happens once). I need to test it first, but I'll commit
> shortly.
> 
> On Tue, 2008-03-25 at 10:06 -0500, Michael Wilde wrote:
>> One thing I'll test is generating the info file on /tmp, and moving it 
>> when done to the final job dir.
>>
>> I can see adjusting wrapper.sh to go from very light to very logged with 
>> a few increments in the middle that would be most useful.
>>
>> The main option I think we want to leave for users to toggle in common 
>> usage, is whether to run the app with its jobdir on local disk, 
>> typically below /tmp, or on shared disk.  The user would decide based on 
>> the job's I/O profile and on local disk space availability.
>>
>> Also, I recall some discussion on the success file. Thats acceptable 
>> overhead for all but the tiniest of jobs, but when a BGP is eventually 
>> running 100K+ short jobs at once, the rate of success file creation 
>> could become a bottleneck. Seems like we could have an option that 
>> avoids creating and expecting the success file if that proved useful - 
>> need to measure.
>>
>> - Mike
>>
>>
>> On 3/25/08 9:32 AM, Mihael Hategan wrote:
>>> Problem may be that, as a quick test shows, bash opens and closes the
>>> info file every time a redirect is done.
>>>
>>> On Tue, 2008-03-25 at 08:44 -0500, Michael Wilde wrote:
>>>> I did runs the day before with a modified wrapper that bypassed the INFO 
>>>> logging. It saved a good amount - I recall about 30% but need to 
>>>> re-check the numbers.
>>>>
>>>> Yes, I came to the same conclusion on the mkdirs.  Im looking at 
>>>> reducing these, likely moving the jobdir to /tmp.  I think I can do that 
>>>> within the current structure.  wrapper.sh is ver clear and nicely 
>>>> written. (Ben: yes, eyeballing the log #s was easy and no problem).
>>>>
>>>> First thing I want to do, though, is run some large scale tests on our 
>>>> two science workflows, increasing the petro-modelling one (the 
>>>> sub-second application) to a larger runtime through app-level batching.
>>>>
>>>> Zhao's latest test indicate that if we do batches of 40, bringing the 
>>>> jobs from .5 sec to 20 sec, we can saturate the BGP's 4K cores and keep 
>>>> it running efficiently. Given the extra wrapper.sh overhead, I might 
>>>> need to increase that another 10X, but once the app is wrapped in a 
>>>> loop, it makes little difference to the user how big we make that.
>>>>
>>>> The other app is a molecule-docking app, that can be batched similarly.
>>>>
>>>> Once we get those running nicely at a larger, less brutal job time, I'll 
>>>> come back to wrapper.sh tuning.  If you or Ben want to do this in the 
>>>> meantime, though, that would be great.  We have the use-local-disk 
>>>> scenario on our development stack anyways - this would be a good time to 
>>>> do it.  If I do it, it will be only a prototype for measurement purposes.
>>>>
>>>> Mike
>>>>
>>>>
>>>>
>>>>
>>>> On 3/25/08 8:34 AM, Mihael Hategan wrote:
>>>>> On Tue, 2008-03-25 at 08:16 -0500, Michael Wilde wrote:
>>>>>> On 3/25/08 3:31 AM, Mihael Hategan wrote:
>>>>>>> On Tue, 2008-03-25 at 00:28 -0500, Michael Wilde wrote:
>>>>>>>> I eyeballed the wrapperlogs to get a rough idea of what was happening.
>>>>>>>>
>>>>>>>> I ran with wrapperlog saving and no other changes for wf's of 10, 100 
>>>>>>>> and 500 jobs, to see how the exec time grew.  At 500 jobs it grew to 
>>>>>>>> about 30+ seconds for a core app exec time of about 1 sec. (Im just 
>>>>>>>> recollecting the times as at this point I didnt write much down).
>>>>>>>>
>>>>>>> I would personally like to see those logs.
>>>>>> I listed all the runs in the previous mail (below), Mihael. They are on 
>>>>>> CI NFS at ~benc/swift-logs/wilde/run{345-350}.
>>>>> Sorry about that.
>>>>>
>>>>>>  Let us know what you find.
>>>>>>
>>>>> It looks like this:
>>>>> - 5 seconds between LOG_START and CREATE_JOBDIR. Likely hogs:
>>>>> mkdir -p $WFDIR/info/$JOBDIR
>>>>> mkdir -p $WFDIR/status/$JOBDIR
>>>>> and the creation of the info file.
>>>>> - 2.5 seconds between CREATE_JOBDIR and CREATE_INPUTDIR. Likely problem:
>>>>> mkdir -p $DIR
>>>>> (on a very fuzzy note, if one mkdir takes 2.5 seconds, two will take 5,
>>>>> which seems to roughly fit the observed numbers).
>>>>> - 3.5 seconds for COPYING_OUTPUTS
>>>>> - 2.5 seconds for RM_JOBDIR
>>>>>
>>>>> I'd be curious to know how much of the time is actually spent writing to
>>>>> the logs. That's because I see one second between EXECUTE_DONE and
>>>>> COPYING_OUTPUTS, a place where the only meaningful things that are done
>>>>> are two log messages.
>>>>>
>>>>> Perhaps it may be useful to run the whole thing through strace -T.
>>>>>
>>>>> Mihael
>>>>>
>>>>>
>>>
> 
> 


From hategan at mcs.anl.gov  Tue Mar 25 10:52:27 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 25 Mar 2008 10:52:27 -0500
Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to
	explain plateaus...]
In-Reply-To: <47E91745.7090909@mcs.anl.gov>
References: <47E7DC3D.6040704@cs.uchicago.edu>
	<47E7E30A.2020208@mcs.anl.gov> <47E7E684.5070303@cs.uchicago.edu>
	<47E7EC03.1090609@mcs.anl.gov>
	<1206381792.11561.1.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0803242335510.5372@dildano.hawaga.org.uk>
	<47E86E5A.6080303@mcs.anl.gov>  <47E88D8C.4090207@mcs.anl.gov>
	<1206433879.26701.0.camel@blabla.mcs.anl.gov>
	<47E8FB17.1080501@mcs.anl.gov>
	<1206452083.31476.12.camel@blabla.mcs.anl.gov>
	<47E901C8.1060106@mcs.anl.gov>
	<1206455574.31476.15.camel@blabla.mcs.anl.gov>
	<47E91506.2080100@mcs.anl.gov>
	<1206457746.20249.1.camel@blabla.mcs.anl.gov>
	<47E91745.7090909@mcs.anl.gov>
Message-ID: <1206460347.20974.8.camel@blabla.mcs.anl.gov>

Done.

On Tue, 2008-03-25 at 10:16 -0500, Michael Wilde wrote:
> Great, thanks Mihael.  Thats a useful step. I'll test.
> 
> - Mike
> 
> On 3/25/08 10:09 AM, Mihael Hategan wrote:
> > I just wrote a version of the wrapper that opens the log in a descriptor
> > (so opening happens once). I need to test it first, but I'll commit
> > shortly.
> > 
> > On Tue, 2008-03-25 at 10:06 -0500, Michael Wilde wrote:
> >> One thing I'll test is generating the info file on /tmp, and moving it 
> >> when done to the final job dir.
> >>
> >> I can see adjusting wrapper.sh to go from very light to very logged with 
> >> a few increments in the middle that would be most useful.
> >>
> >> The main option I think we want to leave for users to toggle in common 
> >> usage, is whether to run the app with its jobdir on local disk, 
> >> typically below /tmp, or on shared disk.  The user would decide based on 
> >> the job's I/O profile and on local disk space availability.
> >>
> >> Also, I recall some discussion on the success file. Thats acceptable 
> >> overhead for all but the tiniest of jobs, but when a BGP is eventually 
> >> running 100K+ short jobs at once, the rate of success file creation 
> >> could become a bottleneck. Seems like we could have an option that 
> >> avoids creating and expecting the success file if that proved useful - 
> >> need to measure.
> >>
> >> - Mike
> >>
> >>
> >> On 3/25/08 9:32 AM, Mihael Hategan wrote:
> >>> Problem may be that, as a quick test shows, bash opens and closes the
> >>> info file every time a redirect is done.
> >>>
> >>> On Tue, 2008-03-25 at 08:44 -0500, Michael Wilde wrote:
> >>>> I did runs the day before with a modified wrapper that bypassed the INFO 
> >>>> logging. It saved a good amount - I recall about 30% but need to 
> >>>> re-check the numbers.
> >>>>
> >>>> Yes, I came to the same conclusion on the mkdirs.  Im looking at 
> >>>> reducing these, likely moving the jobdir to /tmp.  I think I can do that 
> >>>> within the current structure.  wrapper.sh is ver clear and nicely 
> >>>> written. (Ben: yes, eyeballing the log #s was easy and no problem).
> >>>>
> >>>> First thing I want to do, though, is run some large scale tests on our 
> >>>> two science workflows, increasing the petro-modelling one (the 
> >>>> sub-second application) to a larger runtime through app-level batching.
> >>>>
> >>>> Zhao's latest test indicate that if we do batches of 40, bringing the 
> >>>> jobs from .5 sec to 20 sec, we can saturate the BGP's 4K cores and keep 
> >>>> it running efficiently. Given the extra wrapper.sh overhead, I might 
> >>>> need to increase that another 10X, but once the app is wrapped in a 
> >>>> loop, it makes little difference to the user how big we make that.
> >>>>
> >>>> The other app is a molecule-docking app, that can be batched similarly.
> >>>>
> >>>> Once we get those running nicely at a larger, less brutal job time, I'll 
> >>>> come back to wrapper.sh tuning.  If you or Ben want to do this in the 
> >>>> meantime, though, that would be great.  We have the use-local-disk 
> >>>> scenario on our development stack anyways - this would be a good time to 
> >>>> do it.  If I do it, it will be only a prototype for measurement purposes.
> >>>>
> >>>> Mike
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On 3/25/08 8:34 AM, Mihael Hategan wrote:
> >>>>> On Tue, 2008-03-25 at 08:16 -0500, Michael Wilde wrote:
> >>>>>> On 3/25/08 3:31 AM, Mihael Hategan wrote:
> >>>>>>> On Tue, 2008-03-25 at 00:28 -0500, Michael Wilde wrote:
> >>>>>>>> I eyeballed the wrapperlogs to get a rough idea of what was happening.
> >>>>>>>>
> >>>>>>>> I ran with wrapperlog saving and no other changes for wf's of 10, 100 
> >>>>>>>> and 500 jobs, to see how the exec time grew.  At 500 jobs it grew to 
> >>>>>>>> about 30+ seconds for a core app exec time of about 1 sec. (Im just 
> >>>>>>>> recollecting the times as at this point I didnt write much down).
> >>>>>>>>
> >>>>>>> I would personally like to see those logs.
> >>>>>> I listed all the runs in the previous mail (below), Mihael. They are on 
> >>>>>> CI NFS at ~benc/swift-logs/wilde/run{345-350}.
> >>>>> Sorry about that.
> >>>>>
> >>>>>>  Let us know what you find.
> >>>>>>
> >>>>> It looks like this:
> >>>>> - 5 seconds between LOG_START and CREATE_JOBDIR. Likely hogs:
> >>>>> mkdir -p $WFDIR/info/$JOBDIR
> >>>>> mkdir -p $WFDIR/status/$JOBDIR
> >>>>> and the creation of the info file.
> >>>>> - 2.5 seconds between CREATE_JOBDIR and CREATE_INPUTDIR. Likely problem:
> >>>>> mkdir -p $DIR
> >>>>> (on a very fuzzy note, if one mkdir takes 2.5 seconds, two will take 5,
> >>>>> which seems to roughly fit the observed numbers).
> >>>>> - 3.5 seconds for COPYING_OUTPUTS
> >>>>> - 2.5 seconds for RM_JOBDIR
> >>>>>
> >>>>> I'd be curious to know how much of the time is actually spent writing to
> >>>>> the logs. That's because I see one second between EXECUTE_DONE and
> >>>>> COPYING_OUTPUTS, a place where the only meaningful things that are done
> >>>>> are two log messages.
> >>>>>
> >>>>> Perhaps it may be useful to run the whole thing through strace -T.
> >>>>>
> >>>>> Mihael
> >>>>>
> >>>>>
> >>>
> > 
> > 
> 


From wilde at mcs.anl.gov  Tue Mar 25 11:36:22 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Tue, 25 Mar 2008 11:36:22 -0500
Subject: [Swift-devel] Re: How to wait on functions that return no data?
In-Reply-To: <1206460871.20974.18.camel@blabla.mcs.anl.gov>
References: <47E91036.9070806@mcs.anl.gov>	
	<1206457254.19756.5.camel@blabla.mcs.anl.gov>	
	<47E916E0.2020103@mcs.anl.gov>	
	<1206458624.20974.7.camel@blabla.mcs.anl.gov>	
	<47E91E33.3020700@mcs.anl.gov>
	<1206460871.20974.18.camel@blabla.mcs.anl.gov>
Message-ID: <47E92A06.4090705@mcs.anl.gov>

Related to this virtual idea, is it possible to add language semantics 
where a function defined as returning an object can decide to return 
"null", in which case its deemed to be complete but decided no to 
generate a result?

So a foreach that calls 1000 functions could complete when 10 return 
files and 990 return null?

I'm moving this discussion to swift-devel by the way, as its now talking 
about future possibilities.

 From a pure language point of view, we should permit the return of data 
that can be grouped (batched) into files files in arbitrary chunks, 
determined and optimized by the implementation. Map-reduce tuples seem 
to work well for this model, and it seems that Swift could encompass it 
with minimal semantic change to the current language.

This petro model app seems to be a good illustration of the use case. 
The function the way Im calling it is basically z = f (x,y) where x,y,z 
are floats.

To treat it as tuples, the return would be (x,y,z) = f(x,y) - ie the 
return is a triple, so that the reduce step simply merges all the output 
tuples and plots them. (example plot below)

- Mike


This  sweep varied the Low-S Light LL and Med-S Light LL production
yields for Diesel fuel and plotted the effect on the Discount Investment:

It shows a sweep on $2 and $3 in this line of adj_crude.txt.

 > -3      Prod_Yields
 > ...
 > 3       Diesel          $2      $3      $4      $5      $6      $7

The production yield is plotted in:

http://www.ci.uchicago.edu/~wilde/psweep1.png


On 3/25/08 11:01 AM, Mihael Hategan wrote:
> I think there is some confusion here between language and
> implementation. 
> 
> The language can express the problem just fine. That's why I'm saying
> you should change doall() to return an array with all the outputs.
> 
> It's the implementation that behaves in a very poor way if the
> applications are very fine grained. You seem to be trying to solve the
> problem by:
> 1. Doing some magic with the way files are moved around
> 2. Convincing Swift that it should work without knowing about data
> dependencies, despite the fact that it only works properly if it knows
> about all data dependencies. By definition.
> 
> There is some middle ground here. It may be possible to let Swift know
> what the data dependencies are, but also prevent it from dealing with
> certain files, by marking them as "virtual" (or whatever the term).
> 
> Mihael
> 
> On Tue, 2008-03-25 at 10:45 -0500, Michael Wilde wrote:
>> Your view has merits in terms of language purity, but I disagree with it.
>>
>> This was posed as an academic question, and I think its interesting to 
>> discuss.
>>
>> The point here is that there's an application that could best be done by 
>> batching up its output, and in fact perhaps by using the map-reduce 
>> representation of tuples for that output.
>>
>> Its still driven by dataflow and data dependencies, just not the 
>> simplistic lock-step dependencies that swift implements today.
>>
>> For example, one way to address the problem is to say that batching of 
>> function calls, the way swift does today, is helpful but ignores the 
>> problem that small tasks often have small data inputs and outputs, and 
>> that these should be batched along with the job execution.
>>
>> That would leave swift language semantics unchanged, but the 
>> implementation would get more efficient and could handle finer-grained 
>> tasks.
>>
>> An even more efficient and interesting approach, fully in keeping with 
>> the language as it stands today, would be to allow tuples to be 
>> expressed as inputs and outputs, and to have swift efficiently and 
>> automatically route (and batch) tuples in and out of jobs.
>>
>> So I view what I was asking for here as a prototype or exploration of 
>> that direction.  It would be good to test the performance of an 
>> implementation that streamed output tuples into a subsequent ("reduce") 
>> stage of processing, before we even consider what the language and/or 
>> implementation would need to do for such a case.
>>
>>
>> On 3/25/08 10:23 AM, Mihael Hategan wrote:
>> ...
>>  > Don't use Swift then. Seriously. If you don't want to express things in
>>  > a dataflow oriented way, and are not satisfied with its performance for
>>  > the given problem, don't use it.
>>
>> I want to express things as dataflow, with high performance, in Swift.
>>
>> Mike
>>
>>
>> On 3/25/08 10:23 AM, Mihael Hategan wrote:
>>> On Tue, 2008-03-25 at 10:14 -0500, Michael Wilde wrote:
>>>>>> In the example below, I want collectResults() to get invoked after all
>>>>  >> the runam() calls complete in doall().
>>>>  >
>>>>  > results = doall();
>>>>  > collectResults(results);
>>>>  >
>>>>  > Mihael
>>>>
>>>> But thats the problem: doall() does not in this example return results. 
>>> Then it should be fixed.
>>>
>>>> If it would return an artificial result, how would we get such a return 
>>>> to wait until all the runam() calls made within the freach() have completed?
>>>>
>>>> Each of the runam() call runs a small model, and in this proposed 
>>>> scenario would leave those results on a local disk for later collection, 
>>>> either in a single shared file that many invocations would append to, or 
>>>> in a set of files.
>>> I don't think the solution to performance problems in Swift is to hack
>>> stuff like that.
>>>
>>>> Then collectresults() would run a job that collects all the data when done.
>>>>
>>>> One approach can be to have collectresults() just run iteratively until 
>>>> it has collected a sufficient number of results.  I.e., to have it not 
>>>> depend on swift to find out when all the runam() calls have completed. 
>>>> That might work.
>>> Don't use Swift then. Seriously. If you don't want to express things in
>>> a dataflow oriented way, and are not satisfied with its performance for
>>> the given problem, don't use it.
>>>
>>> Mihael
>>>
>>>> - Mike
>>>>
>>>>
>>>> On 3/25/08 10:00 AM, Mihael Hategan wrote:
>>>>> On Tue, 2008-03-25 at 09:46 -0500, Michael Wilde wrote:
>>>>>> For the petro-model app Im working on, it would be interesting to run 
>>>>>> the parameter sweep in "map reduce" manner, in which each invocation 
>>>>>> bites off a portion of the parameter space and processes it, resulting 
>>>>>> in a set of result tuples. Each run of the model will produce a set of 
>>>>>> tuples, and when all are done, we want to aggregate and plot the tuples.
>>>>>>
>>>>>> While with batching this is not strictly needed, it would be interesting 
>>>>>> to let the model results accumulate on the local filesystem (as in this 
>>>>>> case they are small) and collect them either at the end of the run, or 
>>>>>> periodically and perhaps asynchronously during the run.
>>>>>>
>>>>>> To do this, we'd want to write the model invocation as a swift function 
>>>>>> with only scalar numeric parameters, and no output.
>>>>> That assertion I'm not sure about.
>>>>>
>>>>>> The question is how to call a zero-returns function in a swift foreach() 
>>>>>> loop, and embed that foreach() in a function that doesnt return until 
>>>>>> all members of the foreach() have been processed.
>>>>> The very notion of "return" as it would appear in a strict language
>>>>> doesn't make much sense in Swift, so I'm not quite sure.
>>>>>
>>>>>> I havent tried to code this yet, because I cant think of a way to 
>>>>>> express it in swift, due to the data-dependency semantics.
>>>>>>
>>>>>> In the example below, I want collectResults() to get invoked after all 
>>>>>> the runam() calls complete in doall().
>>>>> results = doall();
>>>>> collectResults(results);
>>>>>
>>>>> Mihael
>>>>>
>>>>>> Anyone have any ideas?
>>>>>>
>>>>>> This is a low-priority question, just food for thought, as the batched 
>>>>>> way of running this parameter sweep should be straightforward and efficient.
>>>>>>
>>>>>> Mike
>>>>>>
>>>>>>
>>>>>>
>>>>>> // Amiga-Mars Parameter Sweep
>>>>>>
>>>>>> type amout;
>>>>>>
>>>>>> runam (string id , string p1, string p2) // no ret val
>>>>>> {
>>>>>>    app { runam3 id p1 p2 ; }
>>>>>> }
>>>>>>
>>>>>> type params {
>>>>>>    string id;
>>>>>>    string p1;
>>>>>>    string p2;
>>>>>> };
>>>>>>
>>>>>> doall(params p[])
>>>>>> {
>>>>>>    foreach pset in p {
>>>>>>      runam(pset.id, pset.p1, pset.p2);
>>>>>>    }
>>>>>>    // waitTillAllDone();
>>>>>>    // want to block here till all above finish,
>>>>>>    // but no data to wait on.  any way to
>>>>>>    // achieve this???
>>>>>> }
>>>>>>
>>>>>> // Main
>>>>>>
>>>>>> params p[];
>>>>>> p = readdata("paramlist");
>>>>>> doall(p);
>>>>>> amout amdata <some mapping>;
>>>>>> amdata = collectResults();
>>>>>>
>>>>>> // ^^^ Want collectresults to run AFTER all runam() calls finish
>>>>>> //     in the doall() function.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Swift-user mailing list
>>>>>> Swift-user at ci.uchicago.edu
>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>>>>>>
>>>
> 
> 


From hategan at mcs.anl.gov  Tue Mar 25 12:41:54 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 25 Mar 2008 12:41:54 -0500
Subject: [Swift-devel] Re: How to wait on functions that return no data?
In-Reply-To: <47E92A06.4090705@mcs.anl.gov>
References: <47E91036.9070806@mcs.anl.gov>
	<1206457254.19756.5.camel@blabla.mcs.anl.gov>
	<47E916E0.2020103@mcs.anl.gov>
	<1206458624.20974.7.camel@blabla.mcs.anl.gov>
	<47E91E33.3020700@mcs.anl.gov>
	<1206460871.20974.18.camel@blabla.mcs.anl.gov>
	<47E92A06.4090705@mcs.anl.gov>
Message-ID: <1206466914.24237.11.camel@blabla.mcs.anl.gov>


On Tue, 2008-03-25 at 11:36 -0500, Michael Wilde wrote:
> Related to this virtual idea, is it possible to add language semantics 
> where a function defined as returning an object can decide to return 
> "null", in which case its deemed to be complete but decided no to 
> generate a result?

Yes. What I mentioned would be similar.

> 
> So a foreach that calls 1000 functions could complete when 10 return 
> files and 990 return null?

That we don't have.

> 
> I'm moving this discussion to swift-devel by the way, as its now talking 
> about future possibilities.
> 
>  From a pure language point of view, we should permit the return of data 
> that can be grouped (batched) into files files in arbitrary chunks, 
> determined and optimized by the implementation. Map-reduce tuples seem 
> to work well for this model, and it seems that Swift could encompass it 
> with minimal semantic change to the current language.
> 
> This petro model app seems to be a good illustration of the use case. 

Before the petro model, we had the Aphasia model which required pretty
much the same thing. I.e. for some inputs there was no output.

> The function the way Im calling it is basically z = f (x,y) where x,y,z 
> are floats.
> 
> To treat it as tuples, the return would be (x,y,z) = f(x,y) - ie the 
> return is a triple, so that the reduce step simply merges all the output 
> tuples and plots them. (example plot below)

That's a 2d array. t[x][y] = f(x, y); Or even a simple list of arrays
(which we would simulate with an array).

Overall, this does not deal with "missing" elements. That's where user
exceptions, which we spoke of before, would come in:

try {
  t[x][y] = f(x, y);
}
catch (MissingValue) {
  //discard
}

Mihael

> 
> - Mike
> 
> 
> This  sweep varied the Low-S Light LL and Med-S Light LL production
> yields for Diesel fuel and plotted the effect on the Discount Investment:
> 
> It shows a sweep on $2 and $3 in this line of adj_crude.txt.
> 
>  > -3      Prod_Yields
>  > ...
>  > 3       Diesel          $2      $3      $4      $5      $6      $7
> 
> The production yield is plotted in:
> 
> http://www.ci.uchicago.edu/~wilde/psweep1.png
> 
> 
> 
> On 3/25/08 11:01 AM, Mihael Hategan wrote:
> > I think there is some confusion here between language and
> > implementation. 
> > 
> > The language can express the problem just fine. That's why I'm saying
> > you should change doall() to return an array with all the outputs.
> > 
> > It's the implementation that behaves in a very poor way if the
> > applications are very fine grained. You seem to be trying to solve the
> > problem by:
> > 1. Doing some magic with the way files are moved around
> > 2. Convincing Swift that it should work without knowing about data
> > dependencies, despite the fact that it only works properly if it knows
> > about all data dependencies. By definition.
> > 
> > There is some middle ground here. It may be possible to let Swift know
> > what the data dependencies are, but also prevent it from dealing with
> > certain files, by marking them as "virtual" (or whatever the term).
> > 
> > Mihael
> > 
> > On Tue, 2008-03-25 at 10:45 -0500, Michael Wilde wrote:
> >> Your view has merits in terms of language purity, but I disagree with it.
> >>
> >> This was posed as an academic question, and I think its interesting to 
> >> discuss.
> >>
> >> The point here is that there's an application that could best be done by 
> >> batching up its output, and in fact perhaps by using the map-reduce 
> >> representation of tuples for that output.
> >>
> >> Its still driven by dataflow and data dependencies, just not the 
> >> simplistic lock-step dependencies that swift implements today.
> >>
> >> For example, one way to address the problem is to say that batching of 
> >> function calls, the way swift does today, is helpful but ignores the 
> >> problem that small tasks often have small data inputs and outputs, and 
> >> that these should be batched along with the job execution.
> >>
> >> That would leave swift language semantics unchanged, but the 
> >> implementation would get more efficient and could handle finer-grained 
> >> tasks.
> >>
> >> An even more efficient and interesting approach, fully in keeping with 
> >> the language as it stands today, would be to allow tuples to be 
> >> expressed as inputs and outputs, and to have swift efficiently and 
> >> automatically route (and batch) tuples in and out of jobs.
> >>
> >> So I view what I was asking for here as a prototype or exploration of 
> >> that direction.  It would be good to test the performance of an 
> >> implementation that streamed output tuples into a subsequent ("reduce") 
> >> stage of processing, before we even consider what the language and/or 
> >> implementation would need to do for such a case.
> >>
> >>
> >> On 3/25/08 10:23 AM, Mihael Hategan wrote:
> >> ...
> >>  > Don't use Swift then. Seriously. If you don't want to express things in
> >>  > a dataflow oriented way, and are not satisfied with its performance for
> >>  > the given problem, don't use it.
> >>
> >> I want to express things as dataflow, with high performance, in Swift.
> >>
> >> Mike
> >>
> >>
> >> On 3/25/08 10:23 AM, Mihael Hategan wrote:
> >>> On Tue, 2008-03-25 at 10:14 -0500, Michael Wilde wrote:
> >>>>>> In the example below, I want collectResults() to get invoked after all
> >>>>  >> the runam() calls complete in doall().
> >>>>  >
> >>>>  > results = doall();
> >>>>  > collectResults(results);
> >>>>  >
> >>>>  > Mihael
> >>>>
> >>>> But thats the problem: doall() does not in this example return results. 
> >>> Then it should be fixed.
> >>>
> >>>> If it would return an artificial result, how would we get such a return 
> >>>> to wait until all the runam() calls made within the freach() have completed?
> >>>>
> >>>> Each of the runam() call runs a small model, and in this proposed 
> >>>> scenario would leave those results on a local disk for later collection, 
> >>>> either in a single shared file that many invocations would append to, or 
> >>>> in a set of files.
> >>> I don't think the solution to performance problems in Swift is to hack
> >>> stuff like that.
> >>>
> >>>> Then collectresults() would run a job that collects all the data when done.
> >>>>
> >>>> One approach can be to have collectresults() just run iteratively until 
> >>>> it has collected a sufficient number of results.  I.e., to have it not 
> >>>> depend on swift to find out when all the runam() calls have completed. 
> >>>> That might work.
> >>> Don't use Swift then. Seriously. If you don't want to express things in
> >>> a dataflow oriented way, and are not satisfied with its performance for
> >>> the given problem, don't use it.
> >>>
> >>> Mihael
> >>>
> >>>> - Mike
> >>>>
> >>>>
> >>>> On 3/25/08 10:00 AM, Mihael Hategan wrote:
> >>>>> On Tue, 2008-03-25 at 09:46 -0500, Michael Wilde wrote:
> >>>>>> For the petro-model app Im working on, it would be interesting to run 
> >>>>>> the parameter sweep in "map reduce" manner, in which each invocation 
> >>>>>> bites off a portion of the parameter space and processes it, resulting 
> >>>>>> in a set of result tuples. Each run of the model will produce a set of 
> >>>>>> tuples, and when all are done, we want to aggregate and plot the tuples.
> >>>>>>
> >>>>>> While with batching this is not strictly needed, it would be interesting 
> >>>>>> to let the model results accumulate on the local filesystem (as in this 
> >>>>>> case they are small) and collect them either at the end of the run, or 
> >>>>>> periodically and perhaps asynchronously during the run.
> >>>>>>
> >>>>>> To do this, we'd want to write the model invocation as a swift function 
> >>>>>> with only scalar numeric parameters, and no output.
> >>>>> That assertion I'm not sure about.
> >>>>>
> >>>>>> The question is how to call a zero-returns function in a swift foreach() 
> >>>>>> loop, and embed that foreach() in a function that doesnt return until 
> >>>>>> all members of the foreach() have been processed.
> >>>>> The very notion of "return" as it would appear in a strict language
> >>>>> doesn't make much sense in Swift, so I'm not quite sure.
> >>>>>
> >>>>>> I havent tried to code this yet, because I cant think of a way to 
> >>>>>> express it in swift, due to the data-dependency semantics.
> >>>>>>
> >>>>>> In the example below, I want collectResults() to get invoked after all 
> >>>>>> the runam() calls complete in doall().
> >>>>> results = doall();
> >>>>> collectResults(results);
> >>>>>
> >>>>> Mihael
> >>>>>
> >>>>>> Anyone have any ideas?
> >>>>>>
> >>>>>> This is a low-priority question, just food for thought, as the batched 
> >>>>>> way of running this parameter sweep should be straightforward and efficient.
> >>>>>>
> >>>>>> Mike
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> // Amiga-Mars Parameter Sweep
> >>>>>>
> >>>>>> type amout;
> >>>>>>
> >>>>>> runam (string id , string p1, string p2) // no ret val
> >>>>>> {
> >>>>>>    app { runam3 id p1 p2 ; }
> >>>>>> }
> >>>>>>
> >>>>>> type params {
> >>>>>>    string id;
> >>>>>>    string p1;
> >>>>>>    string p2;
> >>>>>> };
> >>>>>>
> >>>>>> doall(params p[])
> >>>>>> {
> >>>>>>    foreach pset in p {
> >>>>>>      runam(pset.id, pset.p1, pset.p2);
> >>>>>>    }
> >>>>>>    // waitTillAllDone();
> >>>>>>    // want to block here till all above finish,
> >>>>>>    // but no data to wait on.  any way to
> >>>>>>    // achieve this???
> >>>>>> }
> >>>>>>
> >>>>>> // Main
> >>>>>>
> >>>>>> params p[];
> >>>>>> p = readdata("paramlist");
> >>>>>> doall(p);
> >>>>>> amout amdata <some mapping>;
> >>>>>> amdata = collectResults();
> >>>>>>
> >>>>>> // ^^^ Want collectresults to run AFTER all runam() calls finish
> >>>>>> //     in the doall() function.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> Swift-user mailing list
> >>>>>> Swift-user at ci.uchicago.edu
> >>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
> >>>>>>
> >>>
> > 
> > 
> 


From iraicu at cs.uchicago.edu  Tue Mar 25 15:31:58 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Tue, 25 Mar 2008 15:31:58 -0500
Subject: [Swift-devel] new bugzilla usage for Falkon
Message-ID: <47E9613E.40906@cs.uchicago.edu>

Hi all,
I just started using Bugzilla to keep track of bugs, problems, and new 
features for Falkon.  You can use the following two links for creating 
new bugs and displaying open bugs.

    * Create new bug:
      http://bugzilla.globus.org/globus/enter_bug.cgi?product=Falkon
    * Display open bugs:
      http://bugzilla.globus.org/globus/buglist.cgi?query_format=specific&order=relevance+desc&bug_status=__open__&product=Falkon&content=

As you encounter problems with Falkon, or want new features added, 
please feel free to use the add new bug form to help keep everything 
organized.

Cheers,
Ioan

-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080325/423206e7/attachment.html>

From benc at hawaga.org.uk  Tue Mar 25 17:49:07 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 25 Mar 2008 22:49:07 +0000 (GMT)
Subject: [Swift-devel] Re: How to wait on functions that return no data?
In-Reply-To: <47E92A06.4090705@mcs.anl.gov>
References: <47E91036.9070806@mcs.anl.gov>
	<1206457254.19756.5.camel@blabla.mcs.anl.gov>
	<47E916E0.2020103@mcs.anl.gov>
	<1206458624.20974.7.camel@blabla.mcs.anl.gov>
	<47E91E33.3020700@mcs.anl.gov>
	<1206460871.20974.18.camel@blabla.mcs.anl.gov>
	<47E92A06.4090705@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0803252245350.5372@dildano.hawaga.org.uk>


On Tue, 25 Mar 2008, Michael Wilde wrote:

> Related to this virtual idea, is it possible to add language semantics where a
> function defined as returning an object can decide to return "null", in which
> case its deemed to be complete but decided no to generate a result?

Going to haskell way, introducing a Maybe type would be that - its a 
dataflow rather than control flow form of exception handling. You declare 
a type as 'maybe resultfile' and values of that type can be either 
'Nothing' or a result file.

You could have an array of:

(Maybe resultfile)[]

where each element is of type 'maybe resultfile' and so can (independent 
of the other elements) be a file or null.

-- 


From benc at hawaga.org.uk  Tue Mar 25 18:04:48 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 25 Mar 2008 23:04:48 +0000 (GMT)
Subject: [Swift-devel] Re: How to wait on functions that return no data?
In-Reply-To: <47E92A06.4090705@mcs.anl.gov>
References: <47E91036.9070806@mcs.anl.gov>
	<1206457254.19756.5.camel@blabla.mcs.anl.gov>
	<47E916E0.2020103@mcs.anl.gov>
	<1206458624.20974.7.camel@blabla.mcs.anl.gov>
	<47E91E33.3020700@mcs.anl.gov>
	<1206460871.20974.18.camel@blabla.mcs.anl.gov>
	<47E92A06.4090705@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0803252258370.5372@dildano.hawaga.org.uk>


On Tue, 25 Mar 2008, Michael Wilde wrote:

> From a pure language point of view, we should permit the return of data that
> can be grouped (batched) into files files in arbitrary chunks, determined and
> optimized by the implementation. Map-reduce tuples seem to work well for this
> model, and it seems that Swift could encompass it with minimal semantic change
> to the current language.

For your example, what way do you want to store the data on the remote 
side - I'm assuming not individual files.

The present dataset model should fairly easily accomodate the description 
of places to store data that aren't files - there's an abstraction in the 
implementation to help with that at the moment (DSHandle, which is what 
deals with the difference between in-memory values and on-disk files; and 
could fairly straightforwardly deal with other storage forms).

One of the project ideas I put in for the google summer of code was to 
play around with this, in fact.

-- 


From hategan at mcs.anl.gov  Tue Mar 25 18:09:44 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Tue, 25 Mar 2008 18:09:44 -0500
Subject: [Swift-devel] Re: How to wait on functions that return no data?
In-Reply-To: <Pine.LNX.4.64.0803252245350.5372@dildano.hawaga.org.uk>
References: <47E91036.9070806@mcs.anl.gov>
	<1206457254.19756.5.camel@blabla.mcs.anl.gov>
	<47E916E0.2020103@mcs.anl.gov>
	<1206458624.20974.7.camel@blabla.mcs.anl.gov>
	<47E91E33.3020700@mcs.anl.gov>
	<1206460871.20974.18.camel@blabla.mcs.anl.gov>
	<47E92A06.4090705@mcs.anl.gov>
	<Pine.LNX.4.64.0803252245350.5372@dildano.hawaga.org.uk>
Message-ID: <1206486584.11261.0.camel@blabla.mcs.anl.gov>


On Tue, 2008-03-25 at 22:49 +0000, Ben Clifford wrote:
> On Tue, 25 Mar 2008, Michael Wilde wrote:
> 
> > Related to this virtual idea, is it possible to add language semantics where a
> > function defined as returning an object can decide to return "null", in which
> > case its deemed to be complete but decided no to generate a result?
> 
> Going to haskell way, introducing a Maybe type would be that - its a 
> dataflow rather than control flow form of exception handling. You declare 
> a type as 'maybe resultfile' and values of that type can be either 
> 'Nothing' or a result file.
> 
> You could have an array of:
> 
> (Maybe resultfile)[]
> 
> where each element is of type 'maybe resultfile' and so can (independent 
> of the other elements) be a file or null.

pretty much like try {a[x] = f(x)} catch {} or maybe(a[x] = f(x)).

> 


From benc at hawaga.org.uk  Tue Mar 25 18:18:06 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Tue, 25 Mar 2008 23:18:06 +0000 (GMT)
Subject: [Swift-devel] Re: How to wait on functions that return no data?
In-Reply-To: <1206486584.11261.0.camel@blabla.mcs.anl.gov>
References: <47E91036.9070806@mcs.anl.gov>
	<1206457254.19756.5.camel@blabla.mcs.anl.gov>
	<47E916E0.2020103@mcs.anl.gov>
	<1206458624.20974.7.camel@blabla.mcs.anl.gov>
	<47E91E33.3020700@mcs.anl.gov>
	<1206460871.20974.18.camel@blabla.mcs.anl.gov>
	<47E92A06.4090705@mcs.anl.gov>
	<Pine.LNX.4.64.0803252245350.5372@dildano.hawaga.org.uk>
	<1206486584.11261.0.camel@blabla.mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0803252316060.28951@dildano.hawaga.org.uk>


On Tue, 25 Mar 2008, Mihael Hategan wrote:

> > where each element is of type 'maybe resultfile' and so can (independent 
> > of the other elements) be a file or null.
> 
> pretty much like try {a[x] = f(x)} catch {} or maybe(a[x] = f(x)).

in the array case, sort of, yes.

Doesn't compare so well when passing round a single non-array value though 
- a = Nothing is different from a not being assigned (yet, or never), 
which is what syntax this like: try { a =f(x) } catch {} alludes to.

The try/catch syntax also alludes to a null response being somehow 
exceptional, rather than a legitimate return value.

-- 


From benc at hawaga.org.uk  Tue Mar 25 22:54:44 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 26 Mar 2008 03:54:44 +0000 (GMT)
Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to
	explain plateaus...]
In-Reply-To: <47E88D8C.4090207@mcs.anl.gov>
References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov>
	<47E7E684.5070303@cs.uchicago.edu> <47E7EC03.1090609@mcs.anl.gov>
	<1206381792.11561.1.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0803242335510.5372@dildano.hawaga.org.uk>
	<47E86E5A.6080303@mcs.anl.gov> <47E88D8C.4090207@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0803260350090.28951@dildano.hawaga.org.uk>


On Tue, 25 Mar 2008, Michael Wilde wrote:

> also will leave the (tiny) job output on /tmp for later aggregation 
> (will have some swift questions on that).

Note that swift doesn't have any concept of non-shared filesystem 
management at the moment - if you want to keep files on a worker-local 
file system that is not accessible to the entire site, Swift doesn't have 
any way of getting a job to run somewhere where it can access that same 
filesystem.

We've talked about worker-local storage management before and agreed that 
it was both and long and hard thing to make happen.

So its interesting to see your experiences in this field, but its probably 
very out of scope for any imminent develoment work.

-- 


From benc at hawaga.org.uk  Tue Mar 25 23:00:42 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Wed, 26 Mar 2008 04:00:42 +0000 (GMT)
Subject: [Fwd: Re: [Swift-devel] Re: swift-falkon problem... plots to
	explain plateaus...]
In-Reply-To: <47E91506.2080100@mcs.anl.gov>
References: <47E7DC3D.6040704@cs.uchicago.edu> <47E7E30A.2020208@mcs.anl.gov>
	<47E7E684.5070303@cs.uchicago.edu> <47E7EC03.1090609@mcs.anl.gov> 
	<1206381792.11561.1.camel@blabla.mcs.anl.gov> 
	<Pine.LNX.4.64.0803242335510.5372@dildano.hawaga.org.uk>
	<47E86E5A.6080303@mcs.anl.gov> <47E88D8C.4090207@mcs.anl.gov>
	<1206433879.26701.0.camel@blabla.mcs.anl.gov>
	<47E8FB17.1080501@mcs.anl.gov>
	<1206452083.31476.12.camel@blabla.mcs.anl.gov>
	<47E901C8.1060106@mcs.anl.gov>
	<1206455574.31476.15.camel@blabla.mcs.anl.gov>
	<47E91506.2080100@mcs.anl.gov>
Message-ID: <Pine.LNX.4.64.0803260356560.28951@dildano.hawaga.org.uk>


On Tue, 25 Mar 2008, Michael Wilde wrote:

> 
> Also, I recall some discussion on the success file. Thats acceptable overhead
> for all but the tiniest of jobs, but when a BGP is eventually running 100K+
> short jobs at once, the rate of success file creation could become a

I think there's an important Swift scoping issue here. Running 100k x 1s 
jobs is outside the scope of what I expect Swift to be used for any time 
soon; so I'm leery of spending time optimising out-of-scope applications 
at the expense of other work.

-- 


From hategan at mcs.anl.gov  Wed Mar 26 04:50:50 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 26 Mar 2008 04:50:50 -0500
Subject: [Swift-devel] Re: How to wait on functions that return no data?
In-Reply-To: <Pine.LNX.4.64.0803252316060.28951@dildano.hawaga.org.uk>
References: <47E91036.9070806@mcs.anl.gov>
	<1206457254.19756.5.camel@blabla.mcs.anl.gov>
	<47E916E0.2020103@mcs.anl.gov>
	<1206458624.20974.7.camel@blabla.mcs.anl.gov>
	<47E91E33.3020700@mcs.anl.gov>
	<1206460871.20974.18.camel@blabla.mcs.anl.gov>
	<47E92A06.4090705@mcs.anl.gov>
	<Pine.LNX.4.64.0803252245350.5372@dildano.hawaga.org.uk>
	<1206486584.11261.0.camel@blabla.mcs.anl.gov>
	<Pine.LNX.4.64.0803252316060.28951@dildano.hawaga.org.uk>
Message-ID: <1206525050.11566.3.camel@blabla.mcs.anl.gov>


On Tue, 2008-03-25 at 23:18 +0000, Ben Clifford wrote:
> On Tue, 25 Mar 2008, Mihael Hategan wrote:
> 
> > > where each element is of type 'maybe resultfile' and so can (independent 
> > > of the other elements) be a file or null.
> > 
> > pretty much like try {a[x] = f(x)} catch {} or maybe(a[x] = f(x)).
> 
> in the array case, sort of, yes.
> 
> Doesn't compare so well when passing round a single non-array value though 
> - a = Nothing is different from a not being assigned (yet, or never), 
> which is what syntax this like: try { a =f(x) } catch {} alludes to.

Right. Though one could say catch() {a = Nothing}.

> 
> The try/catch syntax also alludes to a null response being somehow 
> exceptional, rather than a legitimate return value.
> 

Not the null response, but actually f throwing an exception.


From wilde at mcs.anl.gov  Wed Mar 26 10:28:20 2008
From: wilde at mcs.anl.gov (Michael Wilde)
Date: Wed, 26 Mar 2008 10:28:20 -0500
Subject: [Swift-devel] Re: How to wait on functions that return no data?
In-Reply-To: <Pine.LNX.4.64.0803252258370.5372@dildano.hawaga.org.uk>
References: <47E91036.9070806@mcs.anl.gov>
	<1206457254.19756.5.camel@blabla.mcs.anl.gov>
	<47E916E0.2020103@mcs.anl.gov>
	<1206458624.20974.7.camel@blabla.mcs.anl.gov>
	<47E91E33.3020700@mcs.anl.gov>
	<1206460871.20974.18.camel@blabla.mcs.anl.gov>
	<47E92A06.4090705@mcs.anl.gov>
	<Pine.LNX.4.64.0803252258370.5372@dildano.hawaga.org.uk>
Message-ID: <47EA6B94.9090609@mcs.anl.gov>

Sorry - a long response follows to your simple question:

 > For your example, what way do you want to store the data on the remote
 > side - I'm assuming not individual files.

In this example, a C program takes in 5 data files describing parameters 
of the petroleum refining process, and models various economic, emission 
and production yields. We do parameter sweeps by varying a few of these 
input vars and plotting their effect on an output var. The 5 files are 
text files, bundled into the application wrapper as shell "here 
documents" using cat <<END >datafileN. The parameters are inserted into 
these data files using simple shell var substitution.

In the simple tests I'm running now, I vary 2 input vars, and plot one 
output var.

Each run of the model, which takes about 1 sec, takes 3 parameters (id, 
x, y) from a readdata() file, and puts out a similar line with a 4th 
column, the z value (id, x, y, z). Id is an int, x, y, z are floats.

In the simplest runs, I just run one model per swift job. So id, x and y 
are provided on the command line, and a single file is produced with the 
tuple (id, x, y, z).

I am now testing a batched version, where the app-wrapper script takes a 
range of x and y values with increments, and iterates over that range at 
the specified increments. Each batch results in a single file with all 
the output tuples for that batch. For this case, this is fine, and is 
the end of the problem.

But I asked about the null values to explore a different approach: where 
most batches run and just leave their outputs on a local fiesystem, 
concatenated into one file. The nice thing about having output in tuples 
is that you can batch them in any arbitrary way, and the reduce step can 
sort and select as needed.

I suspect you're not going to like this idea on first consideration. But 
its related to ideas on how to leverage map-reduce, as I mentioned 
earlier, and Ian's suggestion to explore collective operations. Mihael 
thought my take on this was inelegant and inconsistent with data flow. I 
think it can be massaged to fit nicely in the model and provide useful 
capabilities.

Here's one way I thought it could work with the addition of null/Nothing 
to Swift.

The idea was that most or all invocations of the model jobs would return 
Nothing, and the actual results would be collected later in large, 
efficient batches.

If an invocation of a wrapper batch returns null, then a later job can 
go and interrogate the workers to collect the data. One possibility in 
the falkon case was that one job would be broadcast to all workers, and 
collect all files of a desired type. Another approach is that each job 
ensures that there's a background task running on the worker, which 
waits for either some accumulation of data or elapsed time, and then 
transfers what was produced, as a single file. These files would be 
returned either as results of arbitrary actual model runs, or by a 
collector job that runs after all the models are complete.

But, separate from this data collection operation, a Nothing return has 
a more direct use. Its handy in cases when you have a large set of short 
jobs, exploring some parameter space in which results are very sparse. 
In these cases, it would be nice to have a way to say that a job 
succeeded but return null/Nothing. That reduces the need to pass back a 
large number of files that signify "Nothing" in some inefficient manner.

Its also handy for executing jobs that have side effects, and still 
waiting for them to complete.

This gets us to a related issue:

If a swift job could efficiently return a set of swift objects without 
using a file (specifically without placing files back in the shared 
directory) then many of these apps could work beautifully, by returning 
strings or numeric objects, possibly as structs and/r arrays, that 
travel back through the job submission interface rather than getting 
fetched via the data provider. If a cluster of jobs could return data 
efficiently in a single "package" from the cluster, then we could pretty 
readily do map-reduce in swift, efficiently, in perfect concordance with 
the current dataflow model.

Perhaps this later approach is the best to consider: I suspect it could 
be readily implemented, could use a simple file to contain an arbitrary 
set of swift object return values, possibly in a format similar to that 
of readdata().

- Mike


On 3/25/08 6:04 PM, Ben Clifford wrote:
> On Tue, 25 Mar 2008, Michael Wilde wrote:
> 
>> From a pure language point of view, we should permit the return of data that
>> can be grouped (batched) into files files in arbitrary chunks, determined and
>> optimized by the implementation. Map-reduce tuples seem to work well for this
>> model, and it seems that Swift could encompass it with minimal semantic change
>> to the current language.
> 
> For your example, what way do you want to store the data on the remote 
> side - I'm assuming not individual files.
> 
> The present dataset model should fairly easily accomodate the description 
> of places to store data that aren't files - there's an abstraction in the 
> implementation to help with that at the moment (DSHandle, which is what 
> deals with the difference between in-memory values and on-disk files; and 
> could fairly straightforwardly deal with other storage forms).
> 
> One of the project ideas I put in for the google summer of code was to 
> play around with this, in fact.
> 


From hategan at mcs.anl.gov  Wed Mar 26 10:51:33 2008
From: hategan at mcs.anl.gov (Mihael Hategan)
Date: Wed, 26 Mar 2008 10:51:33 -0500
Subject: [Swift-devel] Re: How to wait on functions that return no data?
In-Reply-To: <47EA6B94.9090609@mcs.anl.gov>
References: <47E91036.9070806@mcs.anl.gov>
	<1206457254.19756.5.camel@blabla.mcs.anl.gov>
	<47E916E0.2020103@mcs.anl.gov>
	<1206458624.20974.7.camel@blabla.mcs.anl.gov>
	<47E91E33.3020700@mcs.anl.gov>
	<1206460871.20974.18.camel@blabla.mcs.anl.gov>
	<47E92A06.4090705@mcs.anl.gov>
	<Pine.LNX.4.64.0803252258370.5372@dildano.hawaga.org.uk>
	<47EA6B94.9090609@mcs.anl.gov>
Message-ID: <1206546694.1119.16.camel@blabla.mcs.anl.gov>


> I suspect you're not going to like this idea on first consideration. But 
> its related to ideas on how to leverage map-reduce, as I mentioned 
> earlier, and Ian's suggestion to explore collective operations. Mihael 
> thought my take on this was inelegant and inconsistent with data flow.

Somewhat. What I thought you suggested was pretty much "I don't want to
write my program as dataflow but I want to implement it in a dataflow
language". "And if it doesn't work, then the language should be changed
so that I can".

[...]

> 
> If a swift job could efficiently return a set of swift objects without 
> using a file

In the context of Globus, it seems a bit difficult.

>  (specifically without placing files back in the shared 
> directory) then many of these apps could work beautifully, by returning 
> strings or numeric objects, possibly as structs and/r arrays, that 
> travel back through the job submission interface rather than getting 
> fetched via the data provider. If a cluster of jobs could return data 
> efficiently in a single "package" from the cluster, then we could pretty 
> readily do map-reduce in swift, efficiently, in perfect concordance with 
> the current dataflow model.

One more time: we CAN do map-reduce in Swift. Stop saying we can't.
Please. It's getting silly.

The efficiency issue comes from the fact that the overhead for
distributing very very very small tasks across a wide area network is
very high compared to the task run time. And in the current Swift
implementation it is higher than in the implementation you seem to think
of.

> 
> Perhaps this later approach is the best to consider: I suspect it could 
> be readily implemented, could use a simple file to contain an arbitrary 
> set of swift object return values, possibly in a format similar to that 
> of readdata().

How is this different from the current scheme (besides the data files
being in a different format)?

> 
> - Mike
> 
> 
> 
> 
> 
> 
> 
> On 3/25/08 6:04 PM, Ben Clifford wrote:
> > On Tue, 25 Mar 2008, Michael Wilde wrote:
> > 
> >> From a pure language point of view, we should permit the return of data that
> >> can be grouped (batched) into files files in arbitrary chunks, determined and
> >> optimized by the implementation. Map-reduce tuples seem to work well for this
> >> model, and it seems that Swift could encompass it with minimal semantic change
> >> to the current language.
> > 
> > For your example, what way do you want to store the data on the remote 
> > side - I'm assuming not individual files.
> > 
> > The present dataset model should fairly easily accomodate the description 
> > of places to store data that aren't files - there's an abstraction in the 
> > implementation to help with that at the moment (DSHandle, which is what 
> > deals with the difference between in-memory values and on-disk files; and 
> > could fairly straightforwardly deal with other storage forms).
> > 
> > One of the project ideas I put in for the google summer of code was to 
> > play around with this, in fact.
> > 
> 


From liming at mcs.anl.gov  Wed Mar 26 15:15:53 2008
From: liming at mcs.anl.gov (Lee Liming)
Date: Wed, 26 Mar 2008 14:15:53 -0600
Subject: [Swift-devel] Swift at GlobusWorld
Message-ID: <82AE556D-B408-46FB-83BF-152B5D0CE8D0@mcs.anl.gov>

Hello, members of the Swift incubator project!

I am writing to encourage you to propose a presentation on your work  
for the GlobusWorld track of the Open Source Grid and Cluster  
Conference in May.  (See www.globus.org for details.)

The official deadline for presentation proposals is already passed,  
but members of dev.globus projects and incubators are still encouraged  
to submit proposals as described on the conference website.

This would be a good opportunity to let people know what you are doing  
in your incubator and how they could make use of it.

Hope to see you at the conference,

            -- Lee


From foster at mcs.anl.gov  Wed Mar 26 15:25:08 2008
From: foster at mcs.anl.gov (Ian Foster)
Date: Wed, 26 Mar 2008 15:25:08 -0500
Subject: [Swift-devel] An interesting article on reproducibility
Message-ID: <47EAB124.2040708@mcs.anl.gov>

http://www.bepress.com/cgi/viewcontent.cgi?article=1002&context=bioconductor


From foster at mcs.anl.gov  Wed Mar 26 16:00:39 2008
From: foster at mcs.anl.gov (Ian Foster)
Date: Wed, 26 Mar 2008 16:00:39 -0500
Subject: [Swift-devel] Article on Hadoop, etc.
Message-ID: <47EAB977.2060808@mcs.anl.gov>

http://www.theregister.co.uk/2008/03/26/yahoo_hadoop_summit/

and Dryad: http://research.microsoft.com/research/sv/dryad/


From benc at hawaga.org.uk  Thu Mar 27 23:15:04 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Fri, 28 Mar 2008 04:15:04 +0000 (GMT)
Subject: [Swift-devel] proxy expiration whilst jobs are running through GRAM4
Message-ID: <Pine.LNX.4.64.0803280408310.9854@dildano.hawaga.org.uk>


If the user proxy expires whilst a job is running from Swift through 
GRAM4, that job will hang in the swift runtime.

This is reproducable by running a 5 minutes sleep job with a 2 minute 
proxy.

I think (though I haven't looked at gram server side logs to check) what 
is happening here is that status notifications cannot be delivered because 
of the expired credential; and the job then sits forever waiting for the 
notification that will never come.

If so, then probably it would be better to refresh the credential if 
possible; and fail the job if we know that we cannot get notifications 
because the local proxy has expired.

-- 


From callmeno2 at gmail.com  Sun Mar 30 10:32:45 2008
From: callmeno2 at gmail.com (karthik parvathaneni)
Date: Sun, 30 Mar 2008 21:02:45 +0530
Subject: [Swift-devel] joining mailing list !!
Message-ID: <347f9e8c0803300832r6da1c887lefe20e4f0e2d58fb@mail.gmail.com>

HI .. I AM A STUDENT INTERESTING IN FOLLOWING UP WITH THE WORK THATS IN
PROGRESS ! ..
AND THE PROJECT ITSELF IS CHALLLENGING ... I WOULD LOVE TAKE IT UP ...

regards,
karthik
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080330/211ac49e/attachment.html>

From bugzilla-daemon at mcs.anl.gov  Mon Mar 31 05:22:26 2008
From: bugzilla-daemon at mcs.anl.gov (bugzilla-daemon at mcs.anl.gov)
Date: Mon, 31 Mar 2008 05:22:26 -0500 (CDT)
Subject: [Swift-devel] [Bug 127] New: proxy expiration whilst jobs are
	running through GRAM4
Message-ID: <bug-127-21@http.bugzilla.mcs.anl.gov/swift/>

http://bugzilla.mcs.anl.gov/swift/show_bug.cgi?id=127

           Summary: proxy expiration whilst jobs are running through GRAM4
           Product: Swift
           Version: unspecified
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: General
        AssignedTo: nobody at mcs.anl.gov
        ReportedBy: benc at hawaga.org.uk
                CC: benc at hawaga.org.uk, swift-devel at ci.uchicago.edu


If the user proxy expires whilst a job is running from Swift through GRAM4,
that job will hang in the swift runtime.

This is reproducable by running a 5 minutes sleep job with a 2 minute proxy. 

I think (though I haven't looked at gram server side logs to check) what is
happening here is that status notifications cannot be delivered because of the
expired credential; and the job then sits forever waiting for the notification
that will never come.                                              

If so, then probably it would be better to refresh the credential if possible;
and fail the job if we know that we cannot get notifications because the local
proxy has expired.


-- 
Configure bugmail: http://bugzilla.mcs.anl.gov/swift/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


From benc at hawaga.org.uk  Mon Mar 31 05:23:40 2008
From: benc at hawaga.org.uk (Ben Clifford)
Date: Mon, 31 Mar 2008 10:23:40 +0000 (GMT)
Subject: [Swift-devel] Re: proxy expiration whilst jobs are running through
	GRAM4
In-Reply-To: <Pine.LNX.4.64.0803280408310.9854@dildano.hawaga.org.uk>
References: <Pine.LNX.4.64.0803280408310.9854@dildano.hawaga.org.uk>
Message-ID: <Pine.LNX.4.64.0803311023280.9854@dildano.hawaga.org.uk>


for tracking, I put this in bugzilla as bug 127.

-- 


From foster at mcs.anl.gov  Mon Mar 31 08:11:53 2008
From: foster at mcs.anl.gov (Ian Foster)
Date: Mon, 31 Mar 2008 08:11:53 -0500
Subject: [Swift-devel] interesting program at NeSC
Message-ID: <47F0E319.2020005@mcs.anl.gov>

http://wiki.esi.ac.uk/Principles_of_Provenance


From iraicu at cs.uchicago.edu  Mon Mar 31 10:46:17 2008
From: iraicu at cs.uchicago.edu (Ioan Raicu)
Date: Mon, 31 Mar 2008 10:46:17 -0500
Subject: [Swift-devel] [Fwd: [Dbworld] Final Call for Papers for IEEE T-ASE
 Special Issue on	Scientific Workflow Management and Application]
Message-ID: <47F10749.4040405@cs.uchicago.edu>

Here is a good journal which has a CFP specific to scientific workflow 
systems!  The paper submission deadline is April 30th.

Ioan

-------- Original Message --------
Subject: 	[Dbworld] Final Call for Papers for IEEE T-ASE Special Issue 
on Scientific Workflow Management and Application
Date: 	Mon, 31 Mar 2008 06:23:38 -0500
From: 	Jinjun Chen <jchen at swin.edu.au>
Reply-To: 	dbworld_owner at yahoo.com
To: 	undisclosed-recipients: ;


Final Call for Papers - IEEE T-ASE  Special Issue on Scientific Workflow Management and Applications http://www.swinflow.org/si/t-ase.htm.
 
Deadline for submission has been extended to April 30 2008 due to many requests. Details can be referred to http://www.swinflow.org/si/t-ase.htm.
 
Best Wishes,
Jinjun
_______________________________________________
Please do not post msgs that are not relevant to the database community at large.  Go to www.cs.wisc.edu/dbworld for guidelines and posting forms.
To unsubscribe, go to https://lists.cs.wisc.edu/mailman/listinfo/dbworld


-- 
===================================================
Ioan Raicu
Ph.D. Candidate
===================================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
===================================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
http://dev.globus.org/wiki/Incubator/Falkon
http://dsl-wiki.cs.uchicago.edu/index.php/Main_Page
===================================================
===================================================


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20080331/0215b12a/attachment.html>