[Swift-devel] misassignment of jobs

Allan Espinosa aespinosa at cs.uchicago.edu
Thu Nov 18 18:15:03 CST 2010


2010/11/18 Mihael Hategan <hategan at mcs.anl.gov>:
> Also, can you post sites.xml and the full log?
>
> On Thu, 2010-11-18 at 16:08 -0600, Allan Espinosa wrote:
>> i'm using a file named tc.data
>>
>> 2010-11-17 15:38:50,115-0600 INFO  unknown Using tc.data: tc.data
>> $cat tc.data
>> PADS  sleep_pads     /bin/sleep      INSTALLED INTEL32::LINUX
>> GLOBUS::maxwalltime="00:05:00"
>>
>> BNL-ATLAS_gridgk01.racf.bnl.gov  worker0
>> /usatlas/OSG/engage-scec/worker.pl      INSTALLED INTEL32::LINUX
>> GLOBUS::maxwalltime="02:00:00"
>> BNL-ATLAS_gridgk01.racf.bnl.gov  sleep0  /bin/sleep
>> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> BNL-ATLAS_gridgk01.racf.bnl.gov  sleep            /bin/sleep
>>        INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>>
>> BNL-ATLAS_gridgk02.racf.bnl.gov  worker1
>> /usatlas/OSG/engage-scec/worker.pl      INSTALLED INTEL32::LINUX
>> GLOBUS::maxwalltime="02:00:00"
>> BNL-ATLAS_gridgk02.racf.bnl.gov  sleep1  /bin/sleep
>> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> BNL-ATLAS_gridgk02.racf.bnl.gov  sleep            /bin/sleep
>>        INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>>
>> FNAL_FERMIGRID_fnpcosg1.fnal.gov  worker2
>> /grid/app/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
>> GLOBUS::maxwalltime="02:00:00"
>> FNAL_FERMIGRID_fnpcosg1.fnal.gov  sleep2  /bin/sleep
>> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> FNAL_FERMIGRID_fnpcosg1.fnal.gov  sleep            /bin/sleep
>>         INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>>
>> Firefly_ff-grid3.unl.edu  worker3
>> /panfs/panasas/CMS/app/engage/scec/worker.pl      INSTALLED
>> INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
>> Firefly_ff-grid3.unl.edu  sleep3  /bin/sleep
>> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> Firefly_ff-grid3.unl.edu  sleep            /bin/sleep
>> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>>
>> GridUNESP_CENTRAL_ce.grid.unesp.br  worker4 /osg/app/worker.pl
>> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
>> GridUNESP_CENTRAL_ce.grid.unesp.br  sleep4  /bin/sleep
>>  INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> GridUNESP_CENTRAL_ce.grid.unesp.br  sleep            /bin/sleep
>>           INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>>
>> LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu  worker5
>> /opt/osg/app/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
>> GLOBUS::maxwalltime="02:00:00"
>> LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu  sleep5  /bin/sleep
>>      INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu  sleep            /bin/sleep
>>               INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>>
>> MIT_CMS_ce01.cmsaf.mit.edu  worker6 /osg/app/engage/scec/worker.pl
>>  INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
>> MIT_CMS_ce01.cmsaf.mit.edu  sleep6  /bin/sleep
>> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> MIT_CMS_ce01.cmsaf.mit.edu  sleep            /bin/sleep
>>   INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>>
>> MIT_CMS_ce02.cmsaf.mit.edu  worker7 /osg/app/engage/scec/worker.pl
>>  INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
>> MIT_CMS_ce02.cmsaf.mit.edu  sleep7  /bin/sleep
>> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> MIT_CMS_ce02.cmsaf.mit.edu  sleep            /bin/sleep
>>   INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>>
>> NYSGRID_CORNELL_NYS1_nys1.cac.cornell.edu  worker8
>> /grid-tmp/grid-apps/engage/scec/worker.pl      INSTALLED
>> INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
>> NYSGRID_CORNELL_NYS1_nys1.cac.cornell.edu  sleep8  /bin/sleep
>>         INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> NYSGRID_CORNELL_NYS1_nys1.cac.cornell.edu  sleep            /bin/sleep
>>                  INSTALLED INTEL32::LINUX
>> GLOBUS::maxwalltime="00:05:00"
>>
>> Nebraska_gpn-husker.unl.edu  worker9
>> /opt/osg/app/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
>> GLOBUS::maxwalltime="02:00:00"
>> Nebraska_gpn-husker.unl.edu  sleep9  /bin/sleep
>> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> Nebraska_gpn-husker.unl.edu  sleep            /bin/sleep
>>    INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>>
>> Nebraska_red.unl.edu  worker10 /opt/osg/app/engage/scec/worker.pl
>> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
>> Nebraska_red.unl.edu  sleep10  /bin/sleep                  INSTALLED
>> INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> Nebraska_red.unl.edu  sleep            /bin/sleep
>> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>>
>> Prairiefire_pf-grid.unl.edu  worker11
>> /opt/pfgridapp/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
>> GLOBUS::maxwalltime="02:00:00"
>> Prairiefire_pf-grid.unl.edu  sleep11  /bin/sleep
>> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> Prairiefire_pf-grid.unl.edu  sleep            /bin/sleep
>>    INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>>
>> Purdue-RCAC_osg.rcac.purdue.edu  worker12
>> /apps/osg/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
>> GLOBUS::maxwalltime="02:00:00"
>> Purdue-RCAC_osg.rcac.purdue.edu  sleep12  /bin/sleep
>> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> Purdue-RCAC_osg.rcac.purdue.edu  sleep            /bin/sleep
>>        INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>>
>> RENCI-Engagement_belhaven-1.renci.org  worker13
>> /nfs/osg-app/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
>> GLOBUS::maxwalltime="02:00:00"
>> RENCI-Engagement_belhaven-1.renci.org  sleep13  /bin/sleep
>>      INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> RENCI-Engagement_belhaven-1.renci.org  sleep            /bin/sleep
>>              INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>>
>> SBGrid-Harvard-East_osg-east.hms.harvard.edu  worker14
>> /osg/storage/app/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
>> GLOBUS::maxwalltime="02:00:00"
>> SBGrid-Harvard-East_osg-east.hms.harvard.edu  sleep14  /bin/sleep
>>             INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> SBGrid-Harvard-East_osg-east.hms.harvard.edu  sleep
>> /bin/sleep                  INSTALLED INTEL32::LINUX
>> GLOBUS::maxwalltime="00:05:00"
>>
>> SPRACE_osg-ce.sprace.org.br  worker15 /osg/app/engage/scec/worker.pl
>>    INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
>> SPRACE_osg-ce.sprace.org.br  sleep15  /bin/sleep
>> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> SPRACE_osg-ce.sprace.org.br  sleep            /bin/sleep
>>    INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>>
>> UCHC_CBG_vdgateway.vcell.uchc.edu  worker16
>> /osg/app/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
>> GLOBUS::maxwalltime="02:00:00"
>> UCHC_CBG_vdgateway.vcell.uchc.edu  sleep16  /bin/sleep
>>  INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> UCHC_CBG_vdgateway.vcell.uchc.edu  sleep            /bin/sleep
>>          INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>>
>> UCR-HEP_top.ucr.edu  worker17
>> /data/bottom/osg_app/engage/scec/worker.pl      INSTALLED
>> INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
>> UCR-HEP_top.ucr.edu  sleep17  /bin/sleep                  INSTALLED
>> INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> UCR-HEP_top.ucr.edu  sleep            /bin/sleep
>> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>>
>> UFlorida-HPC_osg.hpc.ufl.edu  worker18 /osg/app/engage/scec/worker.pl
>>     INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
>> UFlorida-HPC_osg.hpc.ufl.edu  sleep18  /bin/sleep
>> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> UFlorida-HPC_osg.hpc.ufl.edu  sleep            /bin/sleep
>>     INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>>
>> UFlorida-PG_pg.ihepa.ufl.edu  worker19
>> /raid/osgpg/pg/app/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
>> GLOBUS::maxwalltime="02:00:00"
>> UFlorida-PG_pg.ihepa.ufl.edu  sleep19  /bin/sleep
>> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> UFlorida-PG_pg.ihepa.ufl.edu  sleep            /bin/sleep
>>     INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>>
>> UMissHEP_umiss001.hep.olemiss.edu  worker20
>> /osgremote/osg_app/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
>> GLOBUS::maxwalltime="02:00:00"
>> UMissHEP_umiss001.hep.olemiss.edu  sleep20  /bin/sleep
>>  INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> UMissHEP_umiss001.hep.olemiss.edu  sleep            /bin/sleep
>>          INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>>
>> UTA_SWT2_gk04.swt2.uta.edu  worker21
>> /cluster/grid/app/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
>> GLOBUS::maxwalltime="02:00:00"
>> UTA_SWT2_gk04.swt2.uta.edu  sleep21  /bin/sleep
>> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> UTA_SWT2_gk04.swt2.uta.edu  sleep            /bin/sleep
>>   INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>>
>> WQCG-Harvard-OSG_tuscany.med.harvard.edu  worker22
>> /osg/storage/app/engage/scec/worker.pl      INSTALLED INTEL32::LINUX
>> GLOBUS::maxwalltime="02:00:00"
>> WQCG-Harvard-OSG_tuscany.med.harvard.edu  sleep22  /bin/sleep
>>         INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>> WQCG-Harvard-OSG_tuscany.med.harvard.edu  sleep            /bin/sleep
>>                 INSTALLED INTEL32::LINUX
>> GLOBUS::maxwalltime="00:05:00"
>>
>> 2010/11/18 Mihael Hategan <hategan at mcs.anl.gov>:
>> > I'm sure there is a reasonable explanation for this.
>> >
>> > Can you post your entire tc.data? And to make sure we're talking about
>> > the right one, can you look at the swift log and use exactly the one
>> > that swift claims is using?
>> >
>> > Mihael
>> >
>> > On Thu, 2010-11-18 at 14:39 -0600, Allan Espinosa wrote:
>> >> tc.data for worker15:
>> >> SPRACE_osg-ce.sprace.org.br  worker15 /osg/app/engage/scec/worker.pl
>> >>    INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
>> >>
>> >> But it was assigned to another site instead:
>> >> $ grep 0erqqq1k worker-*.log
>> >> 2010-11-17 15:38:58,804-0600 DEBUG vdl:execute2 THREAD_ASSOCIATION
>> >> jobid=worker15-0erqqq1k thread
>> >>  host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu replicationGroup=2pnqqq1k
>> >> 2010-11-17 15:38:59,110-0600 INFO  vdl:createdirset START
>> >> jobid=worker15-0erqqq1k host=LIGO_UWM_N
>> >> ce.phys.uwm.edu - Initializing directory structure
>> >> 2010-11-17 15:38:59,137-0600 INFO  vdl:createdirset END
>> >> jobid=worker15-0erqqq1k - Done initializi
>> >> structure
>> >> 2010-11-17 15:38:59,172-0600 INFO  vdl:dostagein START
>> >> jobid=worker15-0erqqq1k - Staging in files
>> >> 2010-11-17 15:38:59,257-0600 INFO  vdl:dostagein END
>> >> jobid=worker15-0erqqq1k - Staging in finishe
>> >> 2010-11-17 15:38:59,323-0600 DEBUG vdl:execute2 JOB_START
>> >> jobid=worker15-0erqqq1k tr=worker15 arg
>> >> //128.135.125.17:61015, SPRACE_osg-ce.sprace.org.br, /tmp, 7200]
>> >> tmpdir=worker-20101117-1538-fe9a
>> >> orker15-0erqqq1k host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu
>> >> 2010-11-17 15:39:01,394-0600 INFO  Execute Submit: in:
>> >> worker-20101117-1538-fe9aq209 command: /bi
>> >> /_swiftwrap worker15-0erqqq1k -jobdir 0 -scratch  -e worker15 -out
>> >> stdout.txt -err stderr.txt -i
>> >>  -k  -cdmfile  -status files -a http://128.135.125.17:61015
>> >> SPRACE_osg-ce.sprace.org.br /tmp 7200
>> >> 2010-11-17 15:39:01,394-0600 INFO  GridExec TASK_DEFINITION:
>> >> Task(type=JOB_SUBMISSION, identity=u
>> >> -1-1290029938030) is /bin/bash shared/_swiftwrap worker15-0erqqq1k
>> >> -jobdir 0 -scratch  -e worker1
>> >> .txt -err stderr.txt -i -d  -if  -of  -k  -cdmfile  -status files -a
>> >> http://128.135.125.17:61015
>> >> .sprace.org.br /tmp 7200
>> >> 2010-11-17 16:49:33,106-0600 DEBUG vdl:checkjobstatus START
>> >> jobid=worker15-0erqqq1k
>> >> 2010-11-17 16:49:33,278-0600 INFO  vdl:checkjobstatus FAILURE
>> >> jobid=worker15-0erqqq1k - Failure f
>> >> 2010-11-17 16:49:38,180-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION
>> >> jobid=worker15-0erqqq1k - A
>> >> ception: Cannot find executable worker15 on site system path
>> >>
>> >> There is no entry for worker15 for the site LIGO_UWM_NEMO in my tc.data
>> >>
>> >> -Allan
>> >>
>> >
>> >
>> >
>> >
>>
>>
>>
>
>
>
>



-- 
Allan M. Espinosa <http://amespinosa.wordpress.com>
PhD student, Computer Science
University of Chicago <http://people.cs.uchicago.edu/~aespinosa>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: condor_osg.xml
Type: text/xml
Size: 12555 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20101118/b205a759/attachment.xml>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: worker-20101117-1538-fe9aq209.log.bz2
Type: application/x-bzip2
Size: 1584471 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20101118/b205a759/attachment.bin>


More information about the Swift-devel mailing list