[Swift-devel] Q about MolDyn
Veronika Nefedova
nefedova at mcs.anl.gov
Mon Aug 6 15:44:16 CDT 2007
Swift thinks that it sent 248 jobs.
nefedova at viper:~/alamines> grep "Running job " MolDyn-244-loops-
dbui34oxjr4j2.log | wc
248 6931 56718
nefedova at viper:~/alamines>
On Aug 6, 2007, at 3:27 PM, Ioan Raicu wrote:
> Everything is idle, there is no work to be done...
>
> iraicu at tg-viz-login2:~/java/Falkon_v0.8.1/service/logs> tail
> GenericPortalWS_perf_per_sec.txt
> 3510.997 0 2 41 24 24 0 0 0.0 0.0 0.0 0.0 57.0 0.0
> 3511.997 0 2 41 24 24 0 0 0.0 0.0 0.0 0.0 57.0 0.0
> 3512.997 0 2 41 24 24 0 0 0.0 0.0 0.0 0.0 57.0 0.0
> 3513.997 0 2 41 24 24 0 0 0.0 0.0 0.0 0.0 57.0 0.0
> 3514.997 0 2 41 24 24 0 0 0.0 0.0 0.0 0.0 57.0 0.0
> 3515.997 0 2 41 24 24 0 0 0.0 0.0 0.0 0.0 57.0 0.0
> 3516.997 0 2 41 24 24 0 0 0.0 0.0 0.0 0.0 57.0 0.0
> 3517.997 0 2 41 24 24 0 0 0.0 0.0 0.0 0.0 57.0 0.0
> 3518.997 0 2 41 24 24 0 0 0.0 0.0 0.0 0.0 57.0 0.0
> 3519.997 0 2 41 24 24 0 0 0.0 0.0 0.0 0.0 57.0 0.0
>
> 24 workers are registered but idle.... queue length 0, 57 jobs
> completed.
>
> Also, see below all 57 jobs, they all finished with an exit code of
> 0, in other words succesfully! How many jobs does Swift think it
> sent?
>
> Ioan
>
> iraicu at tg-viz-login2:~/java/Falkon_v0.8.1/service/logs> cat
> GenericPortalWS_taskPerf.txt
> //taskNum taskID workerID startTimeStamp execTimeStamp
> resultsQueueTimeStamp endTimeStamp waitQueueTime ex
> ecTime resultsQueueTime totalTime exitCode
> 1 urn:0-0-1186428880921 192.5.198.70:50100 510496 560276 560614
> 560629 49780 338 15 50133 0
> 2 urn:0-1-1-0-1186428880939 192.5.198.70:50101 560984 561200 561899
> 561909 216 699 10 925 0
> 3 urn:0-1-2-0-1186428880941 192.5.198.70:50100 560991 561373 562150
> 562159 382 777 9 1168 0
> 4 urn:0-0-1186429254652 192.5.198.71:50100 972312 1034716 1044916
> 1044926 62404 10200 10 72614 0
> 5 urn:0-1-2-0-1186429255467 192.5.198.71:50101 1046318 1046453
> 1047038 1047067 135 585 29 749 0
> 6 urn:0-1-1-0-1186429255461 192.5.198.71:50100 1046315 1046429
> 1053072 1053080 114 6643 8 6765 0
> 7 urn:0-1-3-0-1186429255469 192.5.198.71:50101 1046320 1047051
> 1054256 1054290 731 7205 34 7970 0
> 8 urn:0-1-5-0-1186429255481 192.5.198.71:50101 1046324 1054267
> 1054570 1054579 7943 303 9 8255 0
> 9 urn:0-1-4-0-1186429255479 192.5.198.71:50100 1046322 1053087
> 1056811 1056819 6765 3724 8 10497 0
> 10 urn:0-1-6-0-1186429255484 192.5.198.71:50101 1046326 1054583
> 1058691 1058719 8257 4108 28 12393 0
> 11 urn:0-1-8-0-1186429255495 192.5.198.71:50101 1046331 1058704
> 1059363 1059385 12373 659 22 13054 0
> 12 urn:0-1-7-0-1186429255486 192.5.198.71:50100 1046329 1056826
> 1060315 1060323 10497 3489 8 13994 0
> 13 urn:0-1-9-0-1186429255502 192.5.198.71:50101 1046333 1059375
> 1060589 1060596 13042 1214 7 14263 0
> 14 urn:0-1-11-0-1186429255514 192.5.198.71:50101 1046338 1060603
> 1060954 1061054 14265 351 100 14716 0
> 15 urn:0-1-10-0-1186429255511 192.5.198.71:50100 1046336 1060329
> 1061094 1061126 13993 765 32 14790 0
> 16 urn:0-1-14-0-1186429255533 192.5.198.71:50100 1046691 1061105
> 1065608 1065617 14414 4503 9 18926 0
> 17 urn:0-1-13-0-1186429255535 192.5.198.71:50100 1046693 1065622
> 1066307 1066315 18929 685 8 19622 0
> 18 urn:0-1-12-0-1186429255524 192.5.198.71:50101 1046689 1061045
> 1067540 1067563 14356 6495 23 20874 0
> 19 urn:0-1-15-0-1186429255539 192.5.198.71:50100 1046695 1066320
> 1069262 1069271 19625 2942 9 22576 0
> 20 urn:0-1-16-0-1186429255543 192.5.198.71:50101 1046697 1067551
> 1071003 1071011 20854 3452 8 24314 0
> 21 urn:0-1-18-0-1186429255559 192.5.198.71:50101 1046700 1071016
> 1071664 1071671 24316 648 7 24971 0
> 22 urn:0-1-17-0-1186429255557 192.5.198.71:50100 1046698 1069275
> 1071679 1071692 22577 2404 13 24994 0
> 23 urn:0-1-19-0-1186429255565 192.5.198.71:50101 1046702 1071687
> 1073978 1073988 24985 2291 10 27286 0
> 24 urn:0-1-20-0-1186429255572 192.5.198.71:50101 1046706 1073992
> 1075959 1075969 27286 1967 10 29263 0
> 25 urn:0-1-21-0-1186429255567 192.5.198.71:50100 1046704 1071699
> 1076704 1076713 24995 5005 9 30009 0
> 26 urn:0-1-22-0-1186429255587 192.5.198.71:50101 1046708 1075972
> 1077451 1077459 29264 1479 8 30751 0
> 27 urn:0-1-23-0-1186429255595 192.5.198.71:50100 1046710 1076717
> 1080157 1080165 30007 3440 8 33455 0
> 28 urn:0-1-25-0-1186429255599 192.5.198.71:50101 1046712 1077464
> 1080270 1080286 30752 2806 16 33574 0
> 29 urn:0-1-24-0-1186429255601 192.5.198.71:50100 1046713 1080170
> 1080611 1080619 33457 441 8 33906 0
> 30 urn:0-1-26-0-1186429255613 192.5.198.71:50100 1046717 1080624
> 1080973 1080983 33907 349 10 34266 0
> 31 urn:0-1-28-0-1186429255611 192.5.198.71:50101 1046715 1080281
> 1081405 1081413 33566 1124 8 34698 0
> 32 urn:0-1-27-0-1186429255616 192.5.198.71:50100 1046719 1080986
> 1082989 1082996 34267 2003 7 36277 0
> 33 urn:0-1-30-0-1186429255635 192.5.198.71:50100 1046723 1083002
> 1083370 1083378 36279 368 8 36655 0
> 34 urn:0-1-29-0-1186429255622 192.5.198.71:50101 1046721 1081417
> 1084830 1084837 34696 3413 7 38116 0
> 35 urn:0-1-32-0-1186429255652 192.5.198.71:50101 1047082 1084843
> 1085854 1085879 37761 1011 25 38797 0
> 36 urn:0-1-34-0-1186429255654 192.5.198.71:50101 1047085 1085865
> 1089502 1089511 38780 3637 9 42426 0
> 37 urn:0-1-33-0-1186429255656 192.5.198.71:50101 1047087 1089515
> 1089966 1089974 42428 451 8 42887 0
> 38 urn:0-1-31-0-1186429255642 192.5.198.71:50100 1046725 1083383
> 1091316 1091324 36658 7933 8 44599 0
> 39 urn:0-1-36-0-1186429255664 192.5.198.71:50100 1047092 1091329
> 1092042 1092049 44237 713 7 44957 0
> 40 urn:0-1-38-0-1186429255673 192.5.198.71:50100 1047095 1092055
> 1094242 1094249 44960 2187 7 47154 0
> 41 urn:0-1-35-0-1186429255658 192.5.198.71:50101 1047090 1089979
> 1094418 1094428 42889 4439 10 47338 0
> 42 urn:0-1-40-0-1186429255696 192.5.198.71:50101 1047102 1094433
> 1095082 1095089 47331 649 7 47987 0
> 43 urn:0-1-41-0-1186429255692 192.5.198.71:50101 1047104 1095095
> 1096846 1096853 47991 1751 7 49749 0
> 44 urn:0-1-39-0-1186429255686 192.5.198.71:50100 1047100 1094256
> 1098214 1098221 47156 3958 7 51121 0
> 45 urn:0-1-42-0-1186429255700 192.5.198.71:50101 1047107 1096859
> 1098627 1098637 49752 1768 10 51530 0
> 46 urn:0-1-37-0-1186429255681 192.5.198.67:50100 1047097 1094037
> 1098903 1098910 46940 4866 7 51813 0
> 47 urn:0-1-50-0-1186429255749 192.5.198.67:50101 1047121 1099192
> 1100210 1100246 52071 1018 36 53125 0
> 48 urn:0-1-44-0-1186429255720 192.5.198.57:50101 1047111 1097371
> 1100555 1100562 50260 3184 7 53451 0
> 49 urn:0-1-43-0-1186429255705 192.5.198.66:50100 1047109 1097135
> 1100896 1100904 50026 3761 8 53795 0
> 50 urn:0-1-48-0-1186429255737 192.5.198.71:50101 1047117 1098640
> 1101106 1101127 51523 2466 21 54010 0
> 51 urn:0-1-51-0-1186429255755 192.5.198.55:50100 1047123 1099965
> 1101217 1101224 52842 1252 7 54101 0
> 52 urn:0-1-47-0-1186429255731 192.5.198.71:50100 1047115 1098227
> 1101820 1101828 51112 3593 8 54713 0
> 53 urn:0-1-45-0-1186429255723 192.5.198.57:50100 1047113 1097375
> 1104132 1104139 50262 6757 7 57026 0
> 54 urn:0-1-52-0-1186429255764 192.5.198.67:50101 1047125 1100221
> 1106449 1106458 53096 6228 9 59333 0
> 55 urn:0-1-46-0-1186429255743 192.5.198.67:50100 1047119 1098916
> 1106473 1106481 51797 7557 8 59362 0
> 56 urn:0-1-2-1-1186428881026 192.5.198.70:50101 563313 563384
> 1207793 1207801 71 644409 8 644488 0
> 57 urn:0-1-1-1-1186428881028 192.5.198.70:50100 563315 563413
> 1216404 1216425 98 652991 21 653110 0
>
>
>
> Veronika Nefedova wrote:
>> OK. There is something weird happening. I've got several such
>> entries in my swift log:
>>
>> 2007-08-06 14:46:58,565 DEBUG vdl:execute2 Application exception:
>> Task failed
>> task:execute @ vdl-int.k, line: 332
>> vdl:execute2 @ execute-default.k, line: 22
>> vdl:execute @ MolDyn-244-loops.kml, line: 20
>> antchmbr @ MolDyn-244-loops.kml, line: 2845
>> vdl:mains @ MolDyn-244-loops.kml, line: 2267
>>
>>
>> Looks like antechamber has failed (?). And the failure is only on
>> a swfit side, it never made it across to Falcon (there are no
>> remote directories created). But I see some of antechamber jobs
>> have finished (in shared).
>>
>> Yuqing -- could the changes you've made be responsible for these
>> failures (I do not see how it could though) ?
>>
>> Ioan, what do you see in your logs ion these tasks:
>>
>> 2007-08-06 14:46:58,555 DEBUG TaskImpl Task(type=1, identity=urn:
>> 0-1-56-0-1186429255786) setting status to Failed
>> 2007-08-06 14:46:58,556 DEBUG TaskImpl Task(type=1, identity=urn:
>> 0-1-57-0-1186429255798) setting status to Failed
>> 2007-08-06 14:46:58,558 DEBUG TaskImpl Task(type=1, identity=urn:
>> 0-1-59-0-1186429255800) setting status to Failed
>> 2007-08-06 14:46:58,558 DEBUG TaskImpl Task(type=1, identity=urn:
>> 0-1-60-0-1186429255805) setting status to Failed
>> 2007-08-06 14:46:58,558 DEBUG TaskImpl Task(type=1, identity=urn:
>> 0-1-61-0-1186429255811) setting status to Failed
>> 2007-08-06 14:46:58,558 DEBUG TaskImpl Task(type=1, identity=urn:
>> 0-1-58-0-1186429255814) setting status to Failed
>>
>> Nika
>>
>> On Aug 6, 2007, at 2:29 PM, Ioan Raicu wrote:
>>
>>> OK!
>>> Why don't we do one last run from my allocation, as everything is
>>> set up already and ready to go! Make sure to enable all debug
>>> logging. Falkon is up and running with all debug enabled!
>>>
>>> Falkon location is unchanged from the last experiment.
>>> Falkon Factory Service: http://tg-viz-login2:50010/wsrf/services/
>>> GenericPortal/core/WS/GPFactoryService
>>> Web Server (graphs): http://tg-viz-login2.uc.teragrid.org:51000/
>>> index.htm
>>>
>>> ANL/UC is not quite so idle as it was earlier, but I bet we could
>>> still get 150~200 processors!
>>>
>>> Ioan
>>>
>>> Veronika Nefedova wrote:
>>>> m050 and m179 finished just fine now via GRAM (thanks to Yuqing
>>>> who fixed the m179 just in time!). We could start again the 244-
>>>> molecule run to verify that nothing is wrong with the whole system.
>>>>
>>>> Nika
>>>>
>>>> On Aug 6, 2007, at 12:20 PM, Veronika Nefedova wrote:
>>>>
>>>>>
>>>>> On Aug 6, 2007, at 11:51 AM, Ioan Raicu wrote:
>>>>>
>>>>>
>>>>> I started those 2 molecules via GRAM. I have no trust in m179
>>>>> finishing completely since I didn't change anything. I hope for
>>>>> m050 to finish though...
>>>>> You can watch the swift log on viper in ~nefedova/alamines/
>>>>> MolDyn-2-loops-be9484k93kk21.log
>>>>>
>>>>> Nika
>>>>>
>>>>>> Then, let's try another run with 244 molecules soon, as most
>>>>>> of ANL/UC is free!
>>>>>>
>>>>>> Ioan
>>>>>>
>>>>
>>>>
>>>
>>
>>
>
More information about the Swift-devel
mailing list