[Swift-devel] Q about MolDyn
Ioan Raicu
iraicu at cs.uchicago.edu
Mon Aug 6 22:28:47 CDT 2007
It looks like viper (where Swift is running) is idle, and so is
tg-viz-login2 (where Falkon is running).
What looks evident to me is that the normal list of events is for a
successful task:
iraicu at viper:/home/nefedova/alamines> grep
"urn:0-1-73-2-31-0-0-1186444341989" MolDyn-244-loops-zhgo6be8tjhi1.log
2007-08-06 19:08:25,121 DEBUG TaskImpl Task(type=1,
identity=urn:0-1-73-2-31-0-0-1186444341989) setting status to Submitted
2007-08-06 20:58:17,685 DEBUG NotificationThread notification:
urn:0-1-73-2-31-0-0-1186444341989 0
2007-08-06 20:58:17,723 DEBUG TaskImpl Task(type=1,
identity=urn:0-1-73-2-31-0-0-1186444341989) setting status to Completed
iraicu at viper:/home/nefedova/alamines> grep "setting status to Submitted"
MolDyn-244-loops-zhgo6be8tjhi1.log | wc
17566 175660 2179412
iraicu at viper:/home/nefedova/alamines> grep "NotificationThread
notification" MolDyn-244-loops-zhgo6be8tjhi1.log | wc
7959 55713 785035
iraicu at viper:/home/nefedova/alamines> grep "setting status to Completed"
MolDyn-244-loops-zhgo6be8tjhi1.log | wc
190968 1909680 24003796
Now, 17566 tasks were submitted, 7959 notifiation were received from
Falkon, and 190968 tasks were set to completed...
Obviously this isn't right. Falkon only saw 7959 tasks, so I would
argue that the # of notifications received is correct. The submitted #
of tasks looks like the # I would have expected, but all the tasks did
not make it to Falkon. The Falkon provider is what sits between the
change of status to submitted, and the receipt of the notification, so I
would say that is the first place we need to look for more details...
there used to some extra debug info in the Falkon provider that simply
printed all the tasks that were actually being submitted to Falkon (as
opposed to just the change of status within Karajan). I don't see those
debug statements, I bet they got overwritten in the SVN update.
What about the completed tasks, why are there so many (190K) completed
tasks? Where did they come from?
Yong, are you keeping up with these emails? Do you still have a copy of
the latest Falkon provider that you edited just before you left? Can
you just take a look through there to make sure nothing has been broken
with the SVN updates? If you don't have time for this now (considering
today was your first day on the new job), I'll dig through there and see
if I can make some sense of what is happening!
One last thing, Ben mentioned that the Falkon provider you saw in Nika's
account was different than what was in SVN. Ben, did you at least look
at modification dates? How old was one as opposed to the other? I hope
we did not revert back to an older version that might have had some bug
in it....
Ioan
Veronika Nefedova wrote:
> Well, there are some discrepancies:
>
> nefedova at viper:~/alamines> grep "Completed job"
> MolDyn-244-loops-zhgo6be8tjhi1.log | wc
> 7959 244749 3241072
> nefedova at viper:~/alamines> grep "Running job"
> MolDyn-244-loops-zhgo6be8tjhi1.log | wc
> 17207 564648 7949388
> nefedova at viper:~/alamines>
>
> I.e. almost half of the jobs haven't finished (according to swift)
>
> I also have some exceptions:
>
> 2007-08-06 19:08:49,378 DEBUG TaskImpl Task(type=2,
> identity=urn:0-1-101-2-37-0-0-1186444363341) setting status to Failed
> Exception in getFile
> (80 of those):
> nefedova at viper:~/alamines> grep "ailed"
> MolDyn-244-loops-zhgo6be8tjhi1.log | wc
> 80 880 9705
> nefedova at viper:~/alamines>
>
>
> Nika
>
> On Aug 6, 2007, at 9:36 PM, Veronika Nefedova wrote:
>
>> Whats up now? Everything has stopped, no errors on swift site...
>> Do you have any errors now?
>>
>> Nika
>>
>> On Aug 6, 2007, at 6:04 PM, Ioan Raicu wrote:
>>
>>> OK, I restarted Falkon as well as there were 12K jobs trying to go
>>> through, and keeping the entire ANL/UC site busy, although there was
>>> no Swift on the other end to pick up the notifications...
>>>
>>> here is the new info:
>>>
>>> Falkon Factory Service:
>>> http://tg-viz-login2:50020/wsrf/services/GenericPortal/core/WS/GPFactoryService
>>>
>>> Web server: http://tg-viz-login2.uc.teragrid.org:51000/index.htm
>>>
>>> Note that I changed the port #, its now 50020, so don't forget to
>>> change that before you start Swift...
>>>
>>> Ioan
>>>
>
>
More information about the Swift-devel
mailing list