[Swift-devel] bug 53
Ioan Raicu
iraicu at cs.uchicago.edu
Tue Sep 18 13:39:42 CDT 2007
Actually, TG machines are OK, such as the login nodes, dual processors,
4GB of memory, etc... but when we were running things on the Purdue
login nodes, they were not stable, the nodes were being rebooted daily
and we were loosing the experiments. Also, the login nodes are
typically not made to host long term service that consume large amounts
of system resources, so if we were to run a large workflow for days and
the workflow would consume large amounts of memory and processing power
on that node, the site admins might start complaining. It would be nice
to have a dedicated production machine (with plenty of memory and
processors) just for workflow submission; BTW, some workflows also have
large datasets, so making sure this machine has access to large amonts
of disk is also important (something viper never had). Maybe there is
already one around, we just need to find it.
Ioan
Veronika Nefedova wrote:
> Viper was used only because *no* other machine could handle the size
> of moldyn workflow (not at ANL and not at ci). Now, since the code has
> been reduced dramatically, its quite possible to use terminable -- and
> I've switched to running the tests from terminable this morning.
>
> NIka
>
> On Sep 18, 2007, at 1:13 PM, Ian Foster wrote:
>
>> It seems ridiculous to me that we are still using a student-supported
>> machine to run major applications. Surely we should have one highly
>> capable, well-maintained machine for this? And this shouldn't be a
>> "suggestion" but a clear policy.
>>
>>
>> Sent via BlackBerry from T-Mobile
>>
>> -----Original Message-----
>> From: Ioan Raicu <iraicu at cs.uchicago.edu>
>>
>> Date: Tue, 18 Sep 2007 13:13:08
>> To:Michael Wilde <wilde at mcs.anl.gov>
>> Cc:swift-devel at ci.uchicago.edu
>> Subject: Re: [Swift-devel] bug 53
>>
>>
>>
>>
>> Michael Wilde wrote:
>>> Its not clear when this happened, as Nika and Ioan's workflow
>>> submission from viper has afaik been mostly through Falkon for quite a
>>> while now.
>>>
>>> Nika, perhaps you can shift back to trying the two Falkon approaches
>>> (with higher prio on testing Ioan's retry code) in the meantime.
>>>
>>> Ioan, is CI Support / Ti supporting viper, or are you the "sysadmin"
>>> Ben is referring to?
>>>
>> Yes, I am viper's support. viper is my department office machine.
>>> Ive also suggested in the past that we focus on using evitable and
>>> terminable (and swift03/04) as our main submit hosts, primarily for
>>> support and coordination reasons. Is this a good time to try the
>>> GRAM/non-Falkon workfow there?
>> Sure, but watch out for the large MolDyn runs as 1GB or less of memory
>> is not enough for 244 mol runs.
>>
>> Ioan
>>>
>>> - Mike
>>>
>>>
>>> Ben Clifford wrote:
>>>> sounds like viper had firewall configuration changed recently. viper
>>>> sysadmin needs to help debug basic job submission with simple globus
>>>> tools before that machine is worth using again.
>>>>
>>>> On Tue, 18 Sep 2007, Michael Wilde wrote:
>>>>
>>>>> does the cog equivalent of globus_tcp_source_range also need to be
>>>>> set?
>>>>> is that only for gridftp, or gram as well? or could this be a
>>>>> gridftp hang?
>>>>>
>>>>> - mike
>>>>>
>>>>> Ben Clifford wrote:
>>>>>> can you submit a job using globus-job-run?
>>>>>>
>>>>>> On Tue, 18 Sep 2007, Veronika Nefedova wrote:
>>>>>>
>>>>>>> I set tcp.port.range in swift properties but even a simple
>>>>>>> helloworld
>>>>>>> workflow
>>>>>>> hangs (the submit host doesn't receive the notification from the
>>>>>>> compute
>>>>>>> host
>>>>>>> that the job has finished).
>>>>>>> tcp.port.range=50000,60000
>>>>>>>
>>>>>>> Not sure what else has changed on viper? It used to be a very good
>>>>>>> submit
>>>>>>> host, I never had any problems with it );
>>>>>>>
>>>>>>> Nika
>>>>>>>
>>>>>>> On Sep 18, 2007, at 9:13 AM, Mihael Hategan wrote:
>>>>>>>
>>>>>>>> Should pick that one. If not ~/.globus/cog.properties ->
>>>>>>>> tcp.port.range=begin,end
>>>>>>>>
>>>>>>>> On Tue, 2007-09-18 at 07:42 +0000, Ben Clifford wrote:
>>>>>>>>> Not sure if cog picks up the GLOBUS_whatever environment
>>>>>>>>> variables.
>>>>>>>>> Mihael
>>>>>>>>> presumably knows.
>>>>>>>>>
>>>>>>>>> On Mon, 17 Sep 2007, Ioan Raicu wrote:
>>>>>>>>>
>>>>>>>>>> There is a firewall on viper. Ports 50000 - 60000 are open for
>>>>>>>>>> TCP.
>>>>>>>>>> You
>>>>>>>>>> might want to set the TCP_PORT_RANGE (I am not sure this is the
>>>>>>>>>> exact
>>>>>>>>>> environment variable, but something like that) to be between
>>>>>>>>>> 50K and
>>>>>>>>>> 60K
>>>>>>>>>> ports
>>>>>>>>>> to ensure that GT4 uses one of these open ports.
>>>>>>>>>> Ioan
>>>>>>>>>>
>>>>>>>>>> Veronika Nefedova wrote:
>>>>>>>>>>> The same. You can check the job's status in its log on viper in
>>>>>>>>>>> ~nefedova/alamines/MolDyn-244-loops-20070917-1356-h95gxij8.log.
>>>>>>>>>>>
>>>>>>>>>>> The job is still runnning (i.e. hanging) with the same
>>>>>>>>>>> symptom as
>>>>>>>>>>> before:
>>>>>>>>>>> the first jobs is done and then nothing else gets submitted
>>>>>>>>>>> (the
>>>>>>>>>>> submit host
>>>>>>>>>>> doesn't receive any notification that the job has finished).
>>>>>>>>>>>
>>>>>>>>>>> NIka
>>>>>>>>>>>
>>>>>>>>>>> On Sep 17, 2007, at 9:51 AM, Mihael Hategan wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On Mon, 2007-09-17 at 09:41 -0500, Veronika Nefedova wrote:
>>>>>>>>>>>>> I did 'svn up' in cog directory and then did 'ant dist' in
>>>>>>>>>>>>> the
>>>>>>>>>>>>> same
>>>>>>>>>>>>> directory.
>>>>>>>>>>>> 'ant dist' should be done in the swift directory.
>>>>>>>>>>>>
>>>>>>>>>>>>> My 'svn info' gives me r1740.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sep 17, 2007, at 8:55 AM, Mihael Hategan wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Did you update cog?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, 2007-09-17 at 08:38 -0500, Veronika Nefedova wrote:
>>>>>>>>>>>>>>> No, I've tried with r1740, it still hanged (timed out).
>>>>>>>>>>>>>>> the log is on viper:/home/nefedova/alamines/MolDyn-244-
>>>>>>>>>>>>>>> loops-20070914-1834-pvhyji75.log
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> NIka
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sep 15, 2007, at 10:59 AM, Mihael Hategan wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, 2007-09-15 at 09:06 +0000, Ben Clifford wrote:
>>>>>>>>>>>>>>>>> On Fri, 14 Sep 2007, Mihael Hategan wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Thu, 2007-09-13 at 16:41 -0500, Mihael Hategan
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> Ok, so there's something in.
>>>>>>>>>>>>>>>>>> That something was throttling a bit too much (not
>>>>>>>>>>>>>>>>>> just
>>>>>>>>>>>>>>>>>> jobs,
>>>>>>>>>>>>>>>>>> but all
>>>>>>>>>>>>>>>>>> tasks on that site). I need to take a second look at
>>>>>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>>>>> Is that fixed by cog r1740? It looks like that commit
>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>> intended to.
>>>>>>>>>>>>>>>> It's an attempt to fix it, but it needs to be confirmed
>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>> Nika.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> Swift-devel mailing list
>>>>>>>>>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Swift-devel mailing list
>>>>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Swift-devel mailing list
>>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Swift-devel mailing list
>>>>>>>> Swift-devel at ci.uchicago.edu
>>>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>>>
>>>>>> _______________________________________________
>>>>>> Swift-devel mailing list
>>>>>> Swift-devel at ci.uchicago.edu
>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>> _______________________________________________
>>> Swift-devel mailing list
>>> Swift-devel at ci.uchicago.edu
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>
>>
>> --
>> ============================================
>> Ioan Raicu
>> Ph.D. Student
>> ============================================
>> Distributed Systems Laboratory
>> Computer Science Department
>> University of Chicago
>> 1100 E. 58th Street, Ryerson Hall
>> Chicago, IL 60637
>> ============================================
>> Email: iraicu at cs.uchicago.edu
>> Web: http://www.cs.uchicago.edu/~iraicu
>> http://dsl.cs.uchicago.edu/
>> ============================================
>> ============================================
>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>
--
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web: http://www.cs.uchicago.edu/~iraicu
http://dsl.cs.uchicago.edu/
============================================
============================================
More information about the Swift-devel
mailing list