[Swift-devel] feature request

Michael Wilde wilde at mcs.anl.gov
Fri Apr 24 15:50:48 CDT 2009



On 4/24/09 3:08 PM, Mihael Hategan wrote:
> On Fri, 2009-04-24 at 14:30 -0500, Michael Wilde wrote:
>> Great, Zhao.
>>
>> What's next?
>>
>> Testing of the new condor provider features is important.
> 
> The block allocator and coasters+condor-g are mutually useless.

I need to think that through. I see this set of values:

- The plain condor-g provider is high-value to OSG and TG immediately. 
It overcomes the GRAM2 overhead, and seems like it should open up many 
doors.  Thats what I was asking Zhao to test.

- coasters+condor-g (unless you state otherwise) looks like it needs a 
sanity test by you first; maybe some adjustments to work?

That combination has same high value

- the block allocator has value in addition to the above for systems and 
cases where:
   a) there are limits on #jobs per user that can be queued or run at 
the same time
   b) its desirable for scheduling reasons to do the allocation as one 
big job
   c) its just plain more efficient to allocate in chunks

For this block allocator, I never envisioned (or wanted) something 
fancy: just a parameter hostsPerJob akin to coastersPerCore that would 
set the allocation unit, permitting users to set this to 1 (default 
perhaps), big (all on one job) or some modest number to get CPUs in chunks.

It seems to me that there is value for all the combinations, but it 
certainly merits discussion to pick the smallest number of features and 
interactions we can, to keep development, testing, and above all, usage, 
as simple as possible. But no simpler.

- Mike

> 
> So if we were otherwise planning to stabilize both at the same time,
> then it may probably be better to focus on just one.
> 
>> For failures, it would be good to see if you can go further in at least 
>> lifting out the errors so that developers can either tell you if there 
>> was an error in the testing (which you should fix) or an error in the 
>> code (which they should fix, and you can help get them the info they 
>> need, and identify faster which errors might need more immediate attention).
>>
>> Can you suggest a methodical approach to testing, in terms of:
>>
>> - what tests you need to and plan to run on what systems?
>> - how the reports are organized
>> - how errors are listed and diagnosed
>>
>> I want this to be a more interactive process between you and the 
>> developers, not just "it broke, see dir X"
>>
>> Thanks,
>>
>> Mike
>>
>>
>> On 4/24/09 12:52 PM, Zhao Zhang wrote:
>>> Hi, All
>>>
>>> As I got mapped on gwynn, I redid the tests. The results are
>>>
>>> All language behaviour tests passed
>>> These sites failed: coaster/tgncsa-hg-coaster-pbs-gram2.xml 
>>> coaster/tgncsa-hg-coaster-pbs-gram4.xml
>>> These sites worked: coaster/coaster-local.xml 
>>> coaster/gwynn-coaster-gram2-gram2-condor.xml 
>>> coaster/gwynn-coaster-gram2-gram2-fork.xml 
>>> coaster/renci-engage-coaster.xml coaster/teraport-gt2-gt2-pbs.xml 
>>> coaster/uj-pbs-gram2.xml
>>>
>>> Logs could be found at 
>>> /home/zzhang/swift_coaster/cog/modules/swift/tests/sites/log_all on CI 
>>> network.
>>>
>>> zhao
>>>
>>> Zhao Zhang wrote:
>>>> Hi, again
>>>>
>>>> the test on teraport is successful, here is the log
>>>>
>>>> zhao
>>>>
>>>> testing site configuration: coaster/teraport-gt2-gt2-pbs.xml
>>>> Removing files from previous runs
>>>> Running test 061-cattwo at Thu Apr 23 11:27:19 CDT 2009
>>>> Swift 0.9rc2 swift-r2860 cog-r2388
>>>>
>>>> RunID: 20090423-1127-aluxx4m9
>>>> Progress:
>>>> Progress:  Stage in:1
>>>> Progress:  Submitted:1
>>>> Progress:  Submitted:1
>>>> ...
>>>> Progress:  Active:1
>>>> Progress:  Finished successfully:1
>>>> Final status:  Finished successfully:1
>>>> Cleaning up...
>>>> Shutting down service at https://128.135.125.118:57080
>>>> Got channel MetaChannel: 2129305 -> GSSSChannel-null(1)
>>>> - Done
>>>> expecting 061-cattwo.out.expected
>>>> checking 061-cattwo.out.expected
>>>> Skipping exception test due to test configuration
>>>> Test passed at Thu Apr 23 11:57:39 CDT 2009
>>>> ----------===========================----------
>>>> Running test 130-fmri at Thu Apr 23 11:57:39 CDT 2009
>>>> Swift 0.9rc2 swift-r2860 cog-r2388
>>>>
>>>> RunID: 20090423-1157-r8sarc77
>>>> Progress:
>>>> Progress:  Selecting site:2  Initializing site shared directory:1  
>>>> Stage in:1
>>>> Progress:  Selecting site:2  Stage in:1  Submitting:1
>>>> Progress:  Selecting site:2  Submitting:1  Submitted:1
>>>> ...
>>>> Progress:  Selecting site:2  Submitted:2
>>>> Progress:  Selecting site:2  Submitted:2
>>>> Progress:  Selecting site:2  Submitted:2
>>>> Progress:  Selecting site:2  Submitted:1  Active:1
>>>> Progress:  Selecting site:2  Active:1  Stage out:1
>>>> Progress:  Selecting site:1  Stage in:1  Stage out:1  Finished 
>>>> successfully:1
>>>> Progress:  Submitted:1  Stage out:1  Finished successfully:2
>>>> Progress:  Active:1  Finished successfully:4
>>>> Progress:  Submitting:2  Submitted:1  Finished successfully:5
>>>> Progress:  Active:2  Stage out:1  Finished successfully:5
>>>> Progress:  Submitted:1  Stage out:2  Finished successfully:8
>>>> Final status:  Finished successfully:11
>>>> Cleaning up...
>>>> Shutting down service at https://128.135.125.118:52773
>>>> Got channel MetaChannel: 28761475 -> GSSSChannel-null(1)
>>>> - Done
>>>> expecting 130-fmri.0000.jpeg.expected 130-fmri.0001.jpeg.expected 
>>>> 130-fmri.0002.jpeg.expected
>>>> checking 130-fmri.0000.jpeg.expected
>>>> Skipping exception test due to test configuration
>>>> checking 130-fmri.0001.jpeg.expected
>>>> Skipping exception test due to test configuration
>>>> checking 130-fmri.0002.jpeg.expected
>>>> Skipping exception test due to test configuration
>>>> Test passed at Thu Apr 23 12:04:47 CDT 2009
>>>> ----------===========================----------
>>>> Running test 103-quote at Thu Apr 23 12:04:47 CDT 2009
>>>> Swift 0.9rc2 swift-r2860 cog-r2388
>>>>
>>>> RunID: 20090423-1204-sjzpkfd3
>>>> Progress:
>>>> Progress:  Stage in:1
>>>> Progress:  Submitted:1
>>>> Progress:  Active:1
>>>> Progress:  Finished successfully:1
>>>> Final status:  Finished successfully:1
>>>> Cleaning up...
>>>> Shutting down service at https://128.135.125.118:40813
>>>> Got channel MetaChannel: 28500325 -> GSSSChannel-null(1)
>>>> - Done
>>>> expecting 103-quote.out.expected
>>>> checking 103-quote.out.expected
>>>> Skipping exception test due to test configuration
>>>> Test passed at Thu Apr 23 12:05:05 CDT 2009
>>>> ----------===========================----------
>>>> Running test 1032-singlequote at Thu Apr 23 12:05:05 CDT 2009
>>>> Swift 0.9rc2 swift-r2860 cog-r2388
>>>>
>>>> RunID: 20090423-1205-x2d55af3
>>>> Progress:
>>>> Progress:  Stage in:1
>>>> Progress:  Submitted:1
>>>> Progress:  Active:1
>>>> Progress:  Finished successfully:1
>>>> Final status:  Finished successfully:1
>>>> Cleaning up...
>>>> Shutting down service at https://128.135.125.118:44126
>>>> Got channel MetaChannel: 18100302 -> GSSSChannel-null(1)
>>>> - Done
>>>> expecting 1032-singlequote.out.expected
>>>> checking 1032-singlequote.out.expected
>>>> Skipping exception test due to test configuration
>>>> Test passed at Thu Apr 23 12:05:22 CDT 2009
>>>> ----------===========================----------
>>>> Running test 1031-quote at Thu Apr 23 12:05:22 CDT 2009
>>>> Swift 0.9rc2 swift-r2860 cog-r2388
>>>>
>>>> RunID: 20090423-1205-5aa1ko4e
>>>> Progress:
>>>> Progress:  Stage in:1
>>>> Progress:  Submitted:1
>>>> Progress:  Active:1
>>>> Final status:  Finished successfully:1
>>>> Cleaning up...
>>>> Shutting down service at https://128.135.125.118:43759
>>>> Got channel MetaChannel: 19002607 -> GSSSChannel-null(1)
>>>> - Done
>>>> expecting 1031-quote.*.expected
>>>> No expected output files specified for this test case - not checking 
>>>> output.
>>>> Skipping exception test due to test configuration
>>>> Test passed at Thu Apr 23 12:05:38 CDT 2009
>>>> ----------===========================----------
>>>> Running test 1033-singlequote at Thu Apr 23 12:05:38 CDT 2009
>>>> Swift 0.9rc2 swift-r2860 cog-r2388
>>>>
>>>> RunID: 20090423-1205-8nopyujc
>>>> Progress:
>>>> Progress:  Stage in:1
>>>> Progress:  Submitted:1
>>>> Progress:  Active:1
>>>> Progress:  Finished successfully:1
>>>> Final status:  Finished successfully:1
>>>> Cleaning up...
>>>> Shutting down service at https://128.135.125.118:39924
>>>> Got channel MetaChannel: 31196317 -> GSSSChannel-null(1)
>>>> - Done
>>>> expecting 1033-singlequote.out.expected
>>>> checking 1033-singlequote.out.expected
>>>> Skipping exception test due to test configuration
>>>> Test passed at Thu Apr 23 12:05:56 CDT 2009
>>>> ----------===========================----------
>>>> Running test 141-space-in-filename at Thu Apr 23 12:05:56 CDT 2009
>>>> Swift 0.9rc2 swift-r2860 cog-r2388
>>>>
>>>> RunID: 20090423-1205-aalqz1c4
>>>> Progress:
>>>> Progress:  Stage in:1
>>>> Progress:  Submitted:1
>>>> Progress:  Active:1
>>>> Progress:  Finished successfully:1
>>>> Final status:  Finished successfully:1
>>>> Cleaning up...
>>>> Shutting down service at https://128.135.125.118:60177
>>>> Got channel MetaChannel: 4728458 -> GSSSChannel-null(1)
>>>> - Done
>>>> expecting 141-space-in-filename.space here.out.expected
>>>> checking 141-space-in-filename.space here.out.expected
>>>> Skipping exception test due to test configuration
>>>> Test passed at Thu Apr 23 12:06:15 CDT 2009
>>>> ----------===========================----------
>>>> Running test 142-space-and-quotes at Thu Apr 23 12:06:15 CDT 2009
>>>> Swift 0.9rc2 swift-r2860 cog-r2388
>>>>
>>>> RunID: 20090423-1206-8617gag1
>>>> Progress:
>>>> Progress:  Selecting site:2  Initializing site shared directory:1  
>>>> Stage in:1
>>>> Progress:  Selecting site:2  Submitting:1  Submitted:1
>>>> Progress:  Selecting site:2  Submitted:1  Active:1
>>>> Progress:  Selecting site:2  Active:1  Finished successfully:1
>>>> Progress:  Stage out:1  Finished successfully:3
>>>> Final status:  Finished successfully:4
>>>> Cleaning up...
>>>> Shutting down service at https://128.135.125.118:57945
>>>> Got channel MetaChannel: 16387060 -> GSSSChannel-null(1)
>>>> - Done
>>>> expecting 142-space-and-quotes.2" space ".out.expected 
>>>> 142-space-and-quotes.3' space '.out.expected 
>>>> 142-space-and-quotes.out.expected 142-space-and-quotes. space 
>>>> .out.expected
>>>> checking 142-space-and-quotes.2" space ".out.expected
>>>> Skipping exception test due to test configuration
>>>> checking 142-space-and-quotes.3' space '.out.expected
>>>> Skipping exception test due to test configuration
>>>> checking 142-space-and-quotes.out.expected
>>>> Skipping exception test due to test configuration
>>>> checking 142-space-and-quotes. space .out.expected
>>>> Skipping exception test due to test configuration
>>>> Test passed at Thu Apr 23 12:06:35 CDT 2009
>>>> ----------===========================----------
>>>> All language behaviour tests passed
>>>>
>>>>
>>>>
>>>> Zhao Zhang wrote:
>>>>> Hi, Ben
>>>>>
>>>>> Ben Clifford wrote:
>>>>>> On Thu, 23 Apr 2009, Zhao Zhang wrote:
>>>>>>
>>>>>>  
>>>>>>> Error 1: This is related to CI network setting,
>>>>>>> /etc/grid-security/hostcert.pem. Could anyone help on this? Who 
>>>>>>> should I
>>>>>>> contact?
>>>>>>>     
>>>>>> fletch is broken. But try changing those sites files to use 
>>>>>> gwynn.bsd.uchicago.edu instead.
>>>>>>
>>>>>>  
>>>>>>> Error 2: My certificate is not enabled on teraport, As Mike and I 
>>>>>>> talked last
>>>>>>> night, "certificate revocation list" on CI network is misconfigured.
>>>>>>>     
>>>>>> This looks more like a permissions problem - the directory being 
>>>>>> used in the sites.xml file for that test does not exist and you do 
>>>>>> not have permission to create it.
>>>>>>
>>>>>> In r2874 I have changes tests/sites/coaster/teraport-gt2-gt2-pbs.xml 
>>>>>> to use a different path that should work for you now.
>>>>>>   
>>>>> I tried this out, it failed, then I increased the wall-time to 15 
>>>>> minutes in the coaster/teraport-gt2-gt2-pbs.xml  file.
>>>>> And I am waiting now.
>>>>>
>>>>> zhao
>>>>>
>>>>> [zzhang at communicado sites]$ ./run-site coaster/teraport-gt2-gt2-pbs.xml
>>>>> testing site configuration: coaster/teraport-gt2-gt2-pbs.xml
>>>>> Removing files from previous runs
>>>>> Running test 061-cattwo at Thu Apr 23 11:12:09 CDT 2009
>>>>> Swift 0.9rc2 swift-r2860 cog-r2388
>>>>>
>>>>> RunID: 20090423-1112-6jqlxfcf
>>>>> Progress:
>>>>> Progress:  Stage in:1
>>>>> Progress:  Submitted:1
>>>>> Failed to transfer wrapper log from 
>>>>> 061-cattwo-20090423-1112-6jqlxfcf/info/q on teraport
>>>>> Failed to transfer wrapper log from 
>>>>> 061-cattwo-20090423-1112-6jqlxfcf/info/s on teraport
>>>>> Progress:  Stage in:1
>>>>> Failed to transfer wrapper log from 
>>>>> 061-cattwo-20090423-1112-6jqlxfcf/info/u on teraport
>>>>> Progress:  Failed:1
>>>>> Execution failed:
>>>>>        Exception in cat:
>>>>> Arguments: [061-cattwo.1.in, 061-cattwo.2.in]
>>>>> Host: teraport
>>>>> Directory: 061-cattwo-20090423-1112-6jqlxfcf/jobs/u/cat-umlmrs9j
>>>>> stderr.txt:
>>>>>
>>>>> stdout.txt:
>>>>>
>>>>> ----
>>>>>
>>>>> Caused by:
>>>>>        Job cannot be run with the given max walltime worker 
>>>>> constraint (task: 600, maxwalltime: 300s)
>>>>> Cleaning up...
>>>>> Shutting down service at https://128.135.125.118:58204
>>>>> Got channel MetaChannel: 1297642 -> GSSSChannel-null(1)
>>>>> - Done
>>>>> SWIFT RETURN CODE NON-ZERO - test 061-cattwo
>>>>>
>>>>>>  
>>>>>>> Error 3 & Error 4: I am not active on tgncsa site. Mike said he 
>>>>>>> needed to add
>>>>>>> me to another group.
>>>>>>>     
>>>>>> yes.
>>>>>>
>>>>>> Do you have the list from the end of your test run about which sites 
>>>>>> worked and which did not?
>>>>>>
>>>>>>   
>>>>> _______________________________________________
>>>>> Swift-devel mailing list
>>>>> Swift-devel at ci.uchicago.edu
>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>
>>>> _______________________________________________
>>>> Swift-devel mailing list
>>>> Swift-devel at ci.uchicago.edu
>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 



More information about the Swift-devel mailing list