[Swift-devel] misassignment of jobs
Mihael Hategan
hategan at mcs.anl.gov
Thu Nov 18 17:39:30 CST 2010
Ok. I can see a couple of code paths that can lead to this, but I need
to constrain it some more.
Does this happen every time you run this?
Mihael
On Thu, 2010-11-18 at 16:08 -0600, Allan Espinosa wrote:
> i'm using a file named tc.data
>
> 2010-11-17 15:38:50,115-0600 INFO unknown Using tc.data: tc.data
> $cat tc.data
> PADS sleep_pads /bin/sleep INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="00:05:00"
>
> BNL-ATLAS_gridgk01.racf.bnl.gov worker0
> /usatlas/OSG/engage-scec/worker.pl INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="02:00:00"
> BNL-ATLAS_gridgk01.racf.bnl.gov sleep0 /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> BNL-ATLAS_gridgk01.racf.bnl.gov sleep /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>
> BNL-ATLAS_gridgk02.racf.bnl.gov worker1
> /usatlas/OSG/engage-scec/worker.pl INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="02:00:00"
> BNL-ATLAS_gridgk02.racf.bnl.gov sleep1 /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> BNL-ATLAS_gridgk02.racf.bnl.gov sleep /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>
> FNAL_FERMIGRID_fnpcosg1.fnal.gov worker2
> /grid/app/engage/scec/worker.pl INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="02:00:00"
> FNAL_FERMIGRID_fnpcosg1.fnal.gov sleep2 /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> FNAL_FERMIGRID_fnpcosg1.fnal.gov sleep /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>
> Firefly_ff-grid3.unl.edu worker3
> /panfs/panasas/CMS/app/engage/scec/worker.pl INSTALLED
> INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
> Firefly_ff-grid3.unl.edu sleep3 /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> Firefly_ff-grid3.unl.edu sleep /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>
> GridUNESP_CENTRAL_ce.grid.unesp.br worker4 /osg/app/worker.pl
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
> GridUNESP_CENTRAL_ce.grid.unesp.br sleep4 /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> GridUNESP_CENTRAL_ce.grid.unesp.br sleep /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>
> LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu worker5
> /opt/osg/app/engage/scec/worker.pl INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="02:00:00"
> LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu sleep5 /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu sleep /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>
> MIT_CMS_ce01.cmsaf.mit.edu worker6 /osg/app/engage/scec/worker.pl
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
> MIT_CMS_ce01.cmsaf.mit.edu sleep6 /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> MIT_CMS_ce01.cmsaf.mit.edu sleep /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>
> MIT_CMS_ce02.cmsaf.mit.edu worker7 /osg/app/engage/scec/worker.pl
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
> MIT_CMS_ce02.cmsaf.mit.edu sleep7 /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> MIT_CMS_ce02.cmsaf.mit.edu sleep /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>
> NYSGRID_CORNELL_NYS1_nys1.cac.cornell.edu worker8
> /grid-tmp/grid-apps/engage/scec/worker.pl INSTALLED
> INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
> NYSGRID_CORNELL_NYS1_nys1.cac.cornell.edu sleep8 /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> NYSGRID_CORNELL_NYS1_nys1.cac.cornell.edu sleep /bin/sleep
> INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="00:05:00"
>
> Nebraska_gpn-husker.unl.edu worker9
> /opt/osg/app/engage/scec/worker.pl INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="02:00:00"
> Nebraska_gpn-husker.unl.edu sleep9 /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> Nebraska_gpn-husker.unl.edu sleep /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>
> Nebraska_red.unl.edu worker10 /opt/osg/app/engage/scec/worker.pl
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
> Nebraska_red.unl.edu sleep10 /bin/sleep INSTALLED
> INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> Nebraska_red.unl.edu sleep /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>
> Prairiefire_pf-grid.unl.edu worker11
> /opt/pfgridapp/engage/scec/worker.pl INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="02:00:00"
> Prairiefire_pf-grid.unl.edu sleep11 /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> Prairiefire_pf-grid.unl.edu sleep /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>
> Purdue-RCAC_osg.rcac.purdue.edu worker12
> /apps/osg/engage/scec/worker.pl INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="02:00:00"
> Purdue-RCAC_osg.rcac.purdue.edu sleep12 /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> Purdue-RCAC_osg.rcac.purdue.edu sleep /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>
> RENCI-Engagement_belhaven-1.renci.org worker13
> /nfs/osg-app/engage/scec/worker.pl INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="02:00:00"
> RENCI-Engagement_belhaven-1.renci.org sleep13 /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> RENCI-Engagement_belhaven-1.renci.org sleep /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>
> SBGrid-Harvard-East_osg-east.hms.harvard.edu worker14
> /osg/storage/app/engage/scec/worker.pl INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="02:00:00"
> SBGrid-Harvard-East_osg-east.hms.harvard.edu sleep14 /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> SBGrid-Harvard-East_osg-east.hms.harvard.edu sleep
> /bin/sleep INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="00:05:00"
>
> SPRACE_osg-ce.sprace.org.br worker15 /osg/app/engage/scec/worker.pl
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
> SPRACE_osg-ce.sprace.org.br sleep15 /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> SPRACE_osg-ce.sprace.org.br sleep /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>
> UCHC_CBG_vdgateway.vcell.uchc.edu worker16
> /osg/app/engage/scec/worker.pl INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="02:00:00"
> UCHC_CBG_vdgateway.vcell.uchc.edu sleep16 /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> UCHC_CBG_vdgateway.vcell.uchc.edu sleep /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>
> UCR-HEP_top.ucr.edu worker17
> /data/bottom/osg_app/engage/scec/worker.pl INSTALLED
> INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
> UCR-HEP_top.ucr.edu sleep17 /bin/sleep INSTALLED
> INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> UCR-HEP_top.ucr.edu sleep /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>
> UFlorida-HPC_osg.hpc.ufl.edu worker18 /osg/app/engage/scec/worker.pl
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
> UFlorida-HPC_osg.hpc.ufl.edu sleep18 /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> UFlorida-HPC_osg.hpc.ufl.edu sleep /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>
> UFlorida-PG_pg.ihepa.ufl.edu worker19
> /raid/osgpg/pg/app/engage/scec/worker.pl INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="02:00:00"
> UFlorida-PG_pg.ihepa.ufl.edu sleep19 /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> UFlorida-PG_pg.ihepa.ufl.edu sleep /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>
> UMissHEP_umiss001.hep.olemiss.edu worker20
> /osgremote/osg_app/engage/scec/worker.pl INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="02:00:00"
> UMissHEP_umiss001.hep.olemiss.edu sleep20 /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> UMissHEP_umiss001.hep.olemiss.edu sleep /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>
> UTA_SWT2_gk04.swt2.uta.edu worker21
> /cluster/grid/app/engage/scec/worker.pl INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="02:00:00"
> UTA_SWT2_gk04.swt2.uta.edu sleep21 /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> UTA_SWT2_gk04.swt2.uta.edu sleep /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
>
> WQCG-Harvard-OSG_tuscany.med.harvard.edu worker22
> /osg/storage/app/engage/scec/worker.pl INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="02:00:00"
> WQCG-Harvard-OSG_tuscany.med.harvard.edu sleep22 /bin/sleep
> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="00:05:00"
> WQCG-Harvard-OSG_tuscany.med.harvard.edu sleep /bin/sleep
> INSTALLED INTEL32::LINUX
> GLOBUS::maxwalltime="00:05:00"
>
> 2010/11/18 Mihael Hategan <hategan at mcs.anl.gov>:
> > I'm sure there is a reasonable explanation for this.
> >
> > Can you post your entire tc.data? And to make sure we're talking about
> > the right one, can you look at the swift log and use exactly the one
> > that swift claims is using?
> >
> > Mihael
> >
> > On Thu, 2010-11-18 at 14:39 -0600, Allan Espinosa wrote:
> >> tc.data for worker15:
> >> SPRACE_osg-ce.sprace.org.br worker15 /osg/app/engage/scec/worker.pl
> >> INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="02:00:00"
> >>
> >> But it was assigned to another site instead:
> >> $ grep 0erqqq1k worker-*.log
> >> 2010-11-17 15:38:58,804-0600 DEBUG vdl:execute2 THREAD_ASSOCIATION
> >> jobid=worker15-0erqqq1k thread
> >> host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu replicationGroup=2pnqqq1k
> >> 2010-11-17 15:38:59,110-0600 INFO vdl:createdirset START
> >> jobid=worker15-0erqqq1k host=LIGO_UWM_N
> >> ce.phys.uwm.edu - Initializing directory structure
> >> 2010-11-17 15:38:59,137-0600 INFO vdl:createdirset END
> >> jobid=worker15-0erqqq1k - Done initializi
> >> structure
> >> 2010-11-17 15:38:59,172-0600 INFO vdl:dostagein START
> >> jobid=worker15-0erqqq1k - Staging in files
> >> 2010-11-17 15:38:59,257-0600 INFO vdl:dostagein END
> >> jobid=worker15-0erqqq1k - Staging in finishe
> >> 2010-11-17 15:38:59,323-0600 DEBUG vdl:execute2 JOB_START
> >> jobid=worker15-0erqqq1k tr=worker15 arg
> >> //128.135.125.17:61015, SPRACE_osg-ce.sprace.org.br, /tmp, 7200]
> >> tmpdir=worker-20101117-1538-fe9a
> >> orker15-0erqqq1k host=LIGO_UWM_NEMO_osg-nemo-ce.phys.uwm.edu
> >> 2010-11-17 15:39:01,394-0600 INFO Execute Submit: in:
> >> worker-20101117-1538-fe9aq209 command: /bi
> >> /_swiftwrap worker15-0erqqq1k -jobdir 0 -scratch -e worker15 -out
> >> stdout.txt -err stderr.txt -i
> >> -k -cdmfile -status files -a http://128.135.125.17:61015
> >> SPRACE_osg-ce.sprace.org.br /tmp 7200
> >> 2010-11-17 15:39:01,394-0600 INFO GridExec TASK_DEFINITION:
> >> Task(type=JOB_SUBMISSION, identity=u
> >> -1-1290029938030) is /bin/bash shared/_swiftwrap worker15-0erqqq1k
> >> -jobdir 0 -scratch -e worker1
> >> .txt -err stderr.txt -i -d -if -of -k -cdmfile -status files -a
> >> http://128.135.125.17:61015
> >> .sprace.org.br /tmp 7200
> >> 2010-11-17 16:49:33,106-0600 DEBUG vdl:checkjobstatus START
> >> jobid=worker15-0erqqq1k
> >> 2010-11-17 16:49:33,278-0600 INFO vdl:checkjobstatus FAILURE
> >> jobid=worker15-0erqqq1k - Failure f
> >> 2010-11-17 16:49:38,180-0600 DEBUG vdl:execute2 APPLICATION_EXCEPTION
> >> jobid=worker15-0erqqq1k - A
> >> ception: Cannot find executable worker15 on site system path
> >>
> >> There is no entry for worker15 for the site LIGO_UWM_NEMO in my tc.data
> >>
> >> -Allan
> >>
> >
> >
> >
> >
>
>
>
More information about the Swift-devel
mailing list