[Swift-devel] bug 53
Ioan Raicu
iraicu at cs.uchicago.edu
Tue Sep 18 13:27:34 CDT 2007
IMO, the biggest hurdle for large workflows will be memory (I recommend
2GB+), and if the jobs are short that end up pushing hundreds of
jobs/sec for prolonged periods of time to Falkon, having multiple
processors might also be important.
Ioan
Ian Foster wrote:
> It seems ridiculous to me that we are still using a student-supported machine to run major applications. Surely we should have one highly capable, well-maintained machine for this? And this shouldn't be a "suggestion" but a clear policy.
>
>
> Sent via BlackBerry from T-Mobile
>
> -----Original Message-----
> From: Ioan Raicu <iraicu at cs.uchicago.edu>
>
> Date: Tue, 18 Sep 2007 13:13:08
> To:Michael Wilde <wilde at mcs.anl.gov>
> Cc:swift-devel at ci.uchicago.edu
> Subject: Re: [Swift-devel] bug 53
>
>
>
>
> Michael Wilde wrote:
>
>> Its not clear when this happened, as Nika and Ioan's workflow
>> submission from viper has afaik been mostly through Falkon for quite a
>> while now.
>>
>> Nika, perhaps you can shift back to trying the two Falkon approaches
>> (with higher prio on testing Ioan's retry code) in the meantime.
>>
>> Ioan, is CI Support / Ti supporting viper, or are you the "sysadmin"
>> Ben is referring to?
>>
>>
> Yes, I am viper's support. viper is my department office machine.
>
>> Ive also suggested in the past that we focus on using evitable and
>> terminable (and swift03/04) as our main submit hosts, primarily for
>> support and coordination reasons. Is this a good time to try the
>> GRAM/non-Falkon workfow there?
>>
> Sure, but watch out for the large MolDyn runs as 1GB or less of memory
> is not enough for 244 mol runs.
>
> Ioan
>
>> - Mike
>>
>>
>> Ben Clifford wrote:
>>
>>> sounds like viper had firewall configuration changed recently. viper
>>> sysadmin needs to help debug basic job submission with simple globus
>>> tools before that machine is worth using again.
>>>
>>> On Tue, 18 Sep 2007, Michael Wilde wrote:
>>>
>>>
>>>> does the cog equivalent of globus_tcp_source_range also need to be set?
>>>> is that only for gridftp, or gram as well? or could this be a
>>>> gridftp hang?
>>>>
>>>> - mike
>>>>
>>>> Ben Clifford wrote:
>>>>
>>>>> can you submit a job using globus-job-run?
>>>>>
>>>>> On Tue, 18 Sep 2007, Veronika Nefedova wrote:
>>>>>
>>>>>
>>>>>> I set tcp.port.range in swift properties but even a simple helloworld
>>>>>> workflow
>>>>>> hangs (the submit host doesn't receive the notification from the
>>>>>> compute
>>>>>> host
>>>>>> that the job has finished).
>>>>>> tcp.port.range=50000,60000
>>>>>>
>>>>>> Not sure what else has changed on viper? It used to be a very good
>>>>>> submit
>>>>>> host, I never had any problems with it );
>>>>>>
>>>>>> Nika
>>>>>>
>>>>>> On Sep 18, 2007, at 9:13 AM, Mihael Hategan wrote:
>>>>>>
>>>>>>
>>>>>>> Should pick that one. If not ~/.globus/cog.properties ->
>>>>>>> tcp.port.range=begin,end
>>>>>>>
>>>>>>> On Tue, 2007-09-18 at 07:42 +0000, Ben Clifford wrote:
>>>>>>>
>>>>>>>> Not sure if cog picks up the GLOBUS_whatever environment variables.
>>>>>>>> Mihael
>>>>>>>> presumably knows.
>>>>>>>>
>>>>>>>> On Mon, 17 Sep 2007, Ioan Raicu wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> There is a firewall on viper. Ports 50000 - 60000 are open for
>>>>>>>>> TCP.
>>>>>>>>> You
>>>>>>>>> might want to set the TCP_PORT_RANGE (I am not sure this is the
>>>>>>>>> exact
>>>>>>>>> environment variable, but something like that) to be between
>>>>>>>>> 50K and
>>>>>>>>> 60K
>>>>>>>>> ports
>>>>>>>>> to ensure that GT4 uses one of these open ports.
>>>>>>>>> Ioan
>>>>>>>>>
>>>>>>>>> Veronika Nefedova wrote:
>>>>>>>>>
>>>>>>>>>> The same. You can check the job's status in its log on viper in
>>>>>>>>>> ~nefedova/alamines/MolDyn-244-loops-20070917-1356-h95gxij8.log.
>>>>>>>>>>
>>>>>>>>>> The job is still runnning (i.e. hanging) with the same symptom as
>>>>>>>>>> before:
>>>>>>>>>> the first jobs is done and then nothing else gets submitted (the
>>>>>>>>>> submit host
>>>>>>>>>> doesn't receive any notification that the job has finished).
>>>>>>>>>>
>>>>>>>>>> NIka
>>>>>>>>>>
>>>>>>>>>> On Sep 17, 2007, at 9:51 AM, Mihael Hategan wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> On Mon, 2007-09-17 at 09:41 -0500, Veronika Nefedova wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I did 'svn up' in cog directory and then did 'ant dist' in the
>>>>>>>>>>>> same
>>>>>>>>>>>> directory.
>>>>>>>>>>>>
>>>>>>>>>>> 'ant dist' should be done in the swift directory.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> My 'svn info' gives me r1740.
>>>>>>>>>>>>
>>>>>>>>>>>> On Sep 17, 2007, at 8:55 AM, Mihael Hategan wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Did you update cog?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, 2007-09-17 at 08:38 -0500, Veronika Nefedova wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> No, I've tried with r1740, it still hanged (timed out).
>>>>>>>>>>>>>> the log is on viper:/home/nefedova/alamines/MolDyn-244-
>>>>>>>>>>>>>> loops-20070914-1834-pvhyji75.log
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> NIka
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sep 15, 2007, at 10:59 AM, Mihael Hategan wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, 2007-09-15 at 09:06 +0000, Ben Clifford wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, 14 Sep 2007, Mihael Hategan wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, 2007-09-13 at 16:41 -0500, Mihael Hategan
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Ok, so there's something in.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> That something was throttling a bit too much (not
>>>>>>>>>>>>>>>>> just
>>>>>>>>>>>>>>>>> jobs,
>>>>>>>>>>>>>>>>> but all
>>>>>>>>>>>>>>>>> tasks on that site). I need to take a second look at
>>>>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Is that fixed by cog r1740? It looks like that commit
>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>> intended to.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It's an attempt to fix it, but it needs to be confirmed
>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>> Nika.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> Swift-devel mailing list
>>>>>>>>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Swift-devel mailing list
>>>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Swift-devel mailing list
>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Swift-devel mailing list
>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>
>>>>>>>
>>>>> _______________________________________________
>>>>> Swift-devel mailing list
>>>>> Swift-devel at ci.uchicago.edu
>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>
>>>>>
>>>>>
>>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>
>>
>
>
--
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web: http://www.cs.uchicago.edu/~iraicu
http://dsl.cs.uchicago.edu/
============================================
============================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070918/b66f919f/attachment.html>
More information about the Swift-devel
mailing list