[Swift-devel] Q about MolDyn

Ioan Raicu iraicu at cs.uchicago.edu
Mon Aug 6 15:27:59 CDT 2007


Everything is idle, there is no work to be done...

iraicu at tg-viz-login2:~/java/Falkon_v0.8.1/service/logs> tail 
GenericPortalWS_perf_per_sec.txt
3510.997 0 2 41 24 24 0 0 0.0 0.0 0.0 0.0 57.0 0.0
3511.997 0 2 41 24 24 0 0 0.0 0.0 0.0 0.0 57.0 0.0
3512.997 0 2 41 24 24 0 0 0.0 0.0 0.0 0.0 57.0 0.0
3513.997 0 2 41 24 24 0 0 0.0 0.0 0.0 0.0 57.0 0.0
3514.997 0 2 41 24 24 0 0 0.0 0.0 0.0 0.0 57.0 0.0
3515.997 0 2 41 24 24 0 0 0.0 0.0 0.0 0.0 57.0 0.0
3516.997 0 2 41 24 24 0 0 0.0 0.0 0.0 0.0 57.0 0.0
3517.997 0 2 41 24 24 0 0 0.0 0.0 0.0 0.0 57.0 0.0
3518.997 0 2 41 24 24 0 0 0.0 0.0 0.0 0.0 57.0 0.0
3519.997 0 2 41 24 24 0 0 0.0 0.0 0.0 0.0 57.0 0.0

24 workers are registered but idle.... queue length 0, 57 jobs completed.

Also, see below all 57 jobs, they all finished with an exit code of 0, 
in other words succesfully!  How many jobs does Swift think it sent?

Ioan

iraicu at tg-viz-login2:~/java/Falkon_v0.8.1/service/logs> cat 
GenericPortalWS_taskPerf.txt
//taskNum taskID workerID startTimeStamp execTimeStamp 
resultsQueueTimeStamp endTimeStamp waitQueueTime ex
ecTime resultsQueueTime totalTime exitCode
1 urn:0-0-1186428880921 192.5.198.70:50100 510496 560276 560614 560629 
49780 338 15 50133 0
2 urn:0-1-1-0-1186428880939 192.5.198.70:50101 560984 561200 561899 
561909 216 699 10 925 0
3 urn:0-1-2-0-1186428880941 192.5.198.70:50100 560991 561373 562150 
562159 382 777 9 1168 0
4 urn:0-0-1186429254652 192.5.198.71:50100 972312 1034716 1044916 
1044926 62404 10200 10 72614 0
5 urn:0-1-2-0-1186429255467 192.5.198.71:50101 1046318 1046453 1047038 
1047067 135 585 29 749 0
6 urn:0-1-1-0-1186429255461 192.5.198.71:50100 1046315 1046429 1053072 
1053080 114 6643 8 6765 0
7 urn:0-1-3-0-1186429255469 192.5.198.71:50101 1046320 1047051 1054256 
1054290 731 7205 34 7970 0
8 urn:0-1-5-0-1186429255481 192.5.198.71:50101 1046324 1054267 1054570 
1054579 7943 303 9 8255 0
9 urn:0-1-4-0-1186429255479 192.5.198.71:50100 1046322 1053087 1056811 
1056819 6765 3724 8 10497 0
10 urn:0-1-6-0-1186429255484 192.5.198.71:50101 1046326 1054583 1058691 
1058719 8257 4108 28 12393 0
11 urn:0-1-8-0-1186429255495 192.5.198.71:50101 1046331 1058704 1059363 
1059385 12373 659 22 13054 0
12 urn:0-1-7-0-1186429255486 192.5.198.71:50100 1046329 1056826 1060315 
1060323 10497 3489 8 13994 0
13 urn:0-1-9-0-1186429255502 192.5.198.71:50101 1046333 1059375 1060589 
1060596 13042 1214 7 14263 0
14 urn:0-1-11-0-1186429255514 192.5.198.71:50101 1046338 1060603 1060954 
1061054 14265 351 100 14716 0
15 urn:0-1-10-0-1186429255511 192.5.198.71:50100 1046336 1060329 1061094 
1061126 13993 765 32 14790 0
16 urn:0-1-14-0-1186429255533 192.5.198.71:50100 1046691 1061105 1065608 
1065617 14414 4503 9 18926 0
17 urn:0-1-13-0-1186429255535 192.5.198.71:50100 1046693 1065622 1066307 
1066315 18929 685 8 19622 0
18 urn:0-1-12-0-1186429255524 192.5.198.71:50101 1046689 1061045 1067540 
1067563 14356 6495 23 20874 0
19 urn:0-1-15-0-1186429255539 192.5.198.71:50100 1046695 1066320 1069262 
1069271 19625 2942 9 22576 0
20 urn:0-1-16-0-1186429255543 192.5.198.71:50101 1046697 1067551 1071003 
1071011 20854 3452 8 24314 0
21 urn:0-1-18-0-1186429255559 192.5.198.71:50101 1046700 1071016 1071664 
1071671 24316 648 7 24971 0
22 urn:0-1-17-0-1186429255557 192.5.198.71:50100 1046698 1069275 1071679 
1071692 22577 2404 13 24994 0
23 urn:0-1-19-0-1186429255565 192.5.198.71:50101 1046702 1071687 1073978 
1073988 24985 2291 10 27286 0
24 urn:0-1-20-0-1186429255572 192.5.198.71:50101 1046706 1073992 1075959 
1075969 27286 1967 10 29263 0
25 urn:0-1-21-0-1186429255567 192.5.198.71:50100 1046704 1071699 1076704 
1076713 24995 5005 9 30009 0
26 urn:0-1-22-0-1186429255587 192.5.198.71:50101 1046708 1075972 1077451 
1077459 29264 1479 8 30751 0
27 urn:0-1-23-0-1186429255595 192.5.198.71:50100 1046710 1076717 1080157 
1080165 30007 3440 8 33455 0
28 urn:0-1-25-0-1186429255599 192.5.198.71:50101 1046712 1077464 1080270 
1080286 30752 2806 16 33574 0
29 urn:0-1-24-0-1186429255601 192.5.198.71:50100 1046713 1080170 1080611 
1080619 33457 441 8 33906 0
30 urn:0-1-26-0-1186429255613 192.5.198.71:50100 1046717 1080624 1080973 
1080983 33907 349 10 34266 0
31 urn:0-1-28-0-1186429255611 192.5.198.71:50101 1046715 1080281 1081405 
1081413 33566 1124 8 34698 0
32 urn:0-1-27-0-1186429255616 192.5.198.71:50100 1046719 1080986 1082989 
1082996 34267 2003 7 36277 0
33 urn:0-1-30-0-1186429255635 192.5.198.71:50100 1046723 1083002 1083370 
1083378 36279 368 8 36655 0
34 urn:0-1-29-0-1186429255622 192.5.198.71:50101 1046721 1081417 1084830 
1084837 34696 3413 7 38116 0
35 urn:0-1-32-0-1186429255652 192.5.198.71:50101 1047082 1084843 1085854 
1085879 37761 1011 25 38797 0
36 urn:0-1-34-0-1186429255654 192.5.198.71:50101 1047085 1085865 1089502 
1089511 38780 3637 9 42426 0
37 urn:0-1-33-0-1186429255656 192.5.198.71:50101 1047087 1089515 1089966 
1089974 42428 451 8 42887 0
38 urn:0-1-31-0-1186429255642 192.5.198.71:50100 1046725 1083383 1091316 
1091324 36658 7933 8 44599 0
39 urn:0-1-36-0-1186429255664 192.5.198.71:50100 1047092 1091329 1092042 
1092049 44237 713 7 44957 0
40 urn:0-1-38-0-1186429255673 192.5.198.71:50100 1047095 1092055 1094242 
1094249 44960 2187 7 47154 0
41 urn:0-1-35-0-1186429255658 192.5.198.71:50101 1047090 1089979 1094418 
1094428 42889 4439 10 47338 0
42 urn:0-1-40-0-1186429255696 192.5.198.71:50101 1047102 1094433 1095082 
1095089 47331 649 7 47987 0
43 urn:0-1-41-0-1186429255692 192.5.198.71:50101 1047104 1095095 1096846 
1096853 47991 1751 7 49749 0
44 urn:0-1-39-0-1186429255686 192.5.198.71:50100 1047100 1094256 1098214 
1098221 47156 3958 7 51121 0
45 urn:0-1-42-0-1186429255700 192.5.198.71:50101 1047107 1096859 1098627 
1098637 49752 1768 10 51530 0
46 urn:0-1-37-0-1186429255681 192.5.198.67:50100 1047097 1094037 1098903 
1098910 46940 4866 7 51813 0
47 urn:0-1-50-0-1186429255749 192.5.198.67:50101 1047121 1099192 1100210 
1100246 52071 1018 36 53125 0
48 urn:0-1-44-0-1186429255720 192.5.198.57:50101 1047111 1097371 1100555 
1100562 50260 3184 7 53451 0
49 urn:0-1-43-0-1186429255705 192.5.198.66:50100 1047109 1097135 1100896 
1100904 50026 3761 8 53795 0
50 urn:0-1-48-0-1186429255737 192.5.198.71:50101 1047117 1098640 1101106 
1101127 51523 2466 21 54010 0
51 urn:0-1-51-0-1186429255755 192.5.198.55:50100 1047123 1099965 1101217 
1101224 52842 1252 7 54101 0
52 urn:0-1-47-0-1186429255731 192.5.198.71:50100 1047115 1098227 1101820 
1101828 51112 3593 8 54713 0
53 urn:0-1-45-0-1186429255723 192.5.198.57:50100 1047113 1097375 1104132 
1104139 50262 6757 7 57026 0
54 urn:0-1-52-0-1186429255764 192.5.198.67:50101 1047125 1100221 1106449 
1106458 53096 6228 9 59333 0
55 urn:0-1-46-0-1186429255743 192.5.198.67:50100 1047119 1098916 1106473 
1106481 51797 7557 8 59362 0
56 urn:0-1-2-1-1186428881026 192.5.198.70:50101 563313 563384 1207793 
1207801 71 644409 8 644488 0
57 urn:0-1-1-1-1186428881028 192.5.198.70:50100 563315 563413 1216404 
1216425 98 652991 21 653110 0



Veronika Nefedova wrote:
> OK. There is something weird happening. I've got several such entries 
> in my swift log:
>
> 2007-08-06 14:46:58,565 DEBUG vdl:execute2 Application exception: Task 
> failed
>         task:execute @ vdl-int.k, line: 332
>         vdl:execute2 @ execute-default.k, line: 22
>         vdl:execute @ MolDyn-244-loops.kml, line: 20
>         antchmbr @ MolDyn-244-loops.kml, line: 2845
>         vdl:mains @ MolDyn-244-loops.kml, line: 2267
>
>
> Looks like antechamber has failed (?). And the failure is only on a 
> swfit side, it never made it across to Falcon (there are no remote 
> directories created). But I see some of antechamber jobs have finished 
> (in shared).
>
> Yuqing -- could the changes you've made be responsible for these 
> failures (I do not see how it could though) ?
>
> Ioan, what do you see in your logs ion these tasks:
>
> 2007-08-06 14:46:58,555 DEBUG TaskImpl Task(type=1, 
> identity=urn:0-1-56-0-1186429255786) setting status to Failed
> 2007-08-06 14:46:58,556 DEBUG TaskImpl Task(type=1, 
> identity=urn:0-1-57-0-1186429255798) setting status to Failed
> 2007-08-06 14:46:58,558 DEBUG TaskImpl Task(type=1, 
> identity=urn:0-1-59-0-1186429255800) setting status to Failed
> 2007-08-06 14:46:58,558 DEBUG TaskImpl Task(type=1, 
> identity=urn:0-1-60-0-1186429255805) setting status to Failed
> 2007-08-06 14:46:58,558 DEBUG TaskImpl Task(type=1, 
> identity=urn:0-1-61-0-1186429255811) setting status to Failed
> 2007-08-06 14:46:58,558 DEBUG TaskImpl Task(type=1, 
> identity=urn:0-1-58-0-1186429255814) setting status to Failed
>
> Nika
>
> On Aug 6, 2007, at 2:29 PM, Ioan Raicu wrote:
>
>> OK!
>> Why don't we do one last run from my allocation, as everything is set 
>> up already and ready to go!  Make sure to enable all debug logging.  
>> Falkon is up and running with all debug enabled!
>>
>> Falkon location is unchanged from the last experiment.
>> Falkon Factory Service: 
>> http://tg-viz-login2:50010/wsrf/services/GenericPortal/core/WS/GPFactoryService 
>>
>> Web Server (graphs): 
>> http://tg-viz-login2.uc.teragrid.org:51000/index.htm
>>
>> ANL/UC is not quite so idle as it was earlier, but I bet we could 
>> still get 150~200 processors!
>>
>> Ioan
>>
>> Veronika Nefedova wrote:
>>> m050 and m179 finished just fine now via GRAM (thanks to Yuqing who 
>>> fixed the m179 just in time!). We could start again the 244- 
>>> molecule run to verify that nothing is wrong with the whole system.
>>>
>>> Nika
>>>
>>> On Aug 6, 2007, at 12:20 PM, Veronika Nefedova wrote:
>>>
>>>>
>>>> On Aug 6, 2007, at 11:51 AM, Ioan Raicu wrote:
>>>>
>>>>
>>>> I started those 2 molecules via GRAM. I have no trust in m179 
>>>> finishing completely since I didn't change anything. I hope for 
>>>> m050 to finish though...
>>>> You can watch the swift log on viper in 
>>>> ~nefedova/alamines/MolDyn-2-loops-be9484k93kk21.log
>>>>
>>>> Nika
>>>>
>>>>> Then, let's try another run with 244 molecules soon, as most of 
>>>>> ANL/UC is free!
>>>>>
>>>>> Ioan
>>>>>
>>>
>>>
>>
>
>



More information about the Swift-devel mailing list