[Swift-devel] Q about MolDyn

Ioan Raicu iraicu at cs.uchicago.edu
Mon Aug 6 22:28:47 CDT 2007


It looks like viper (where Swift is running) is idle, and so is 
tg-viz-login2 (where Falkon is running). 

What looks evident to me is that the normal list of events is for a 
successful task:
iraicu at viper:/home/nefedova/alamines> grep 
"urn:0-1-73-2-31-0-0-1186444341989" MolDyn-244-loops-zhgo6be8tjhi1.log
2007-08-06 19:08:25,121 DEBUG TaskImpl Task(type=1, 
identity=urn:0-1-73-2-31-0-0-1186444341989) setting status to Submitted
2007-08-06 20:58:17,685 DEBUG NotificationThread notification: 
urn:0-1-73-2-31-0-0-1186444341989 0
2007-08-06 20:58:17,723 DEBUG TaskImpl Task(type=1, 
identity=urn:0-1-73-2-31-0-0-1186444341989) setting status to Completed

iraicu at viper:/home/nefedova/alamines> grep "setting status to Submitted" 
MolDyn-244-loops-zhgo6be8tjhi1.log | wc
  17566  175660 2179412

iraicu at viper:/home/nefedova/alamines> grep "NotificationThread 
notification" MolDyn-244-loops-zhgo6be8tjhi1.log | wc
   7959   55713  785035

iraicu at viper:/home/nefedova/alamines> grep "setting status to Completed" 
MolDyn-244-loops-zhgo6be8tjhi1.log | wc
 190968 1909680 24003796

Now, 17566 tasks were submitted, 7959 notifiation were received from 
Falkon, and 190968 tasks were set to completed...

Obviously this isn't right.  Falkon only saw 7959 tasks, so I would 
argue that the # of notifications received is correct.  The submitted # 
of tasks looks like the # I would have expected, but all the tasks did 
not make it to Falkon.  The Falkon provider is what sits between the 
change of status to submitted, and the receipt of the notification, so I 
would say that is the first place we need to look for more details... 
there used to some extra debug info in the Falkon provider that simply 
printed all the tasks that were actually being submitted to Falkon (as 
opposed to just the change of status within Karajan).  I don't see those 
debug statements, I bet they got overwritten in the SVN update. 

What about the completed tasks, why are there so many (190K) completed 
tasks?  Where did they come from?

Yong, are you keeping up with these emails?  Do you still have a copy of 
the latest Falkon provider that you edited just before you left?  Can 
you just take a look through there to make sure nothing has been broken 
with the SVN updates?  If you don't have time for this now (considering 
today was your first day on the new job), I'll dig through there and see 
if I can make some sense of what is happening!

One last thing, Ben mentioned that the Falkon provider you saw in Nika's 
account was different than what was in SVN.  Ben, did you at least look 
at modification dates?  How old was one as opposed to the other?  I hope 
we did not revert back to an older version that might have had some bug 
in it....

Ioan

Veronika Nefedova wrote:
> Well, there are some discrepancies:
>
> nefedova at viper:~/alamines> grep "Completed job" 
> MolDyn-244-loops-zhgo6be8tjhi1.log | wc
>    7959  244749 3241072
> nefedova at viper:~/alamines> grep "Running job" 
> MolDyn-244-loops-zhgo6be8tjhi1.log | wc
>   17207  564648 7949388
> nefedova at viper:~/alamines>
>
> I.e. almost half of the jobs haven't finished (according to swift)
>
> I also have some exceptions:
>
> 2007-08-06 19:08:49,378 DEBUG TaskImpl Task(type=2, 
> identity=urn:0-1-101-2-37-0-0-1186444363341) setting status to Failed 
> Exception in getFile
> (80 of those):
> nefedova at viper:~/alamines> grep "ailed" 
> MolDyn-244-loops-zhgo6be8tjhi1.log | wc
>      80     880    9705
> nefedova at viper:~/alamines>
>
>
> Nika
>
> On Aug 6, 2007, at 9:36 PM, Veronika Nefedova wrote:
>
>> Whats up now? Everything has stopped, no errors on swift site...
>> Do you have any errors now?
>>
>> Nika
>>
>> On Aug 6, 2007, at 6:04 PM, Ioan Raicu wrote:
>>
>>> OK, I restarted Falkon as well as there were 12K jobs trying to go 
>>> through, and keeping the entire ANL/UC site busy, although there was 
>>> no Swift on the other end to pick up the notifications...
>>>
>>> here is the new info:
>>>
>>> Falkon Factory Service: 
>>> http://tg-viz-login2:50020/wsrf/services/GenericPortal/core/WS/GPFactoryService 
>>>
>>> Web server: http://tg-viz-login2.uc.teragrid.org:51000/index.htm
>>>
>>> Note that I changed the port #, its now 50020, so don't forget to 
>>> change that before you start Swift...
>>>
>>> Ioan
>>>
>
>



More information about the Swift-devel mailing list