[Swift-devel] Getting PBS email for Falkon failures - no $HOME?

Ioan Raicu iraicu at cs.uchicago.edu
Wed Sep 5 16:37:43 CDT 2007


Hi Mike,
See below for some comments/answers:

Michael Wilde wrote:
> Ioan, I'm doing a run of 1000 angle jobs, with worker throttling 
> parameters as below.  Im seeing about 20 Falkon jobs running with 
> about 50+ nodes total allocated among them (in various numbers from 
> about 1 to 5).
Are any jobs queued up?  Ideally, you want to run 1 job per worker, so 
if you have 50 nodes with 2 workers per node (as indicated in your 
config file), then you really should be seeing 100 jobs at a time 
running at any given time.  Is the 20 jobs running an artifact of Swift 
not sending jobs fast enough?
>
> But I also get in this run 8 pairs of email messages from PBS, where 
> the first says something like this:
>
> PBS Job Id: 1512406.tg-master.uc.teragrid.org
> Job Name:   STDIN
> Aborted by PBS Server
> Job cannot be executed
> See job standard error file
>
> and the second of each pair says:
>
> PBS Job Id: 1512406.tg-master.uc.teragrid.org
> Job Name:   STDIN
> An error has occurred processing your job, see below.
> Post job file processing error; job 1512406.tg-master.uc.teragrid.org 
> on host tg-v082/0+tg-v076/0+tg-v053/0+tg-v040/0+tg-v034/0Unknown 
> resource type  REJHOST=tg-v082.uc.teragrid.org MSG=invalid home 
> directory '/home/wilde' specified, errno=2 (No such file or directory)
>
This sounds like the nodes did not have /home/wilde NFS mounted... not 
much you can do, except moving the Falkon install on GPFS perhaps, to 
avoid NFS.  I have not seen this error in the past, so its not something 
common.
> On the surface, it looks like some kind of mount or automount failure 
> on various worker nodes - they cant see my TG home directory?
>
> Is that likely?
What else can it be, at least that seems to be the problem from the 
errors above.
>
> Do you see such messages as well? 
I haven't seen this message in the past, even when starting many workers 
all at the same time.   You could try to move the Falkon install to 
/disks/scratchgpfs1/wilde/ and see if the problems re-appear.  You won't 
have to change anything, things should work by simply moving and 
restarting everything!

I hope this helps.

Ioa
> Ben, do you?
>
> My Provisioner.config file has:
>
> #Provisioner config file
> #KEY=VALUE
> #if multiple lines have the same key, the previous value will be 
> overwritten with the new valu
> e
> #all paths are relative
>
> #resources numbers
> MinNumExecutors=0
> MaxNumExecutors=250
> ExecutorsPerHost=2
>
> #resources times
> MinResourceAllocationTime_min=60
> MaxResourceAllocationTime_min=60
>
> #resources types
> HostType=any
> #HostType=ia32_compute
> #HostType=ia64_compute
>
> #allocation strategies
> #AllocationStrategy=one_at_a_time
> #AllocationStrategy=additive
> #AllocationStrategy=exponential
> AllocationStrategy=additive
> MinNumHostsPerAllocation=10
> MaxNumHostsPerAllocation=100
>
> #de-allocation strategies, 0 means never de-allocate due to idle time
> DeAllocationIdleTime_sec=300000
> # ^^^^ in msec 300,000 = 300 secs = 5 min
>
> #Falkon information
> FalkonServiceURI=http://tg-viz-login1.uc.teragrid.org:50011/wsrf/services/GenericPortal/core/W 
>
> S/GPFactoryService
> #FalkonServiceURI=http://viper.uchicago.edu:50001/wsrf/services/GenericPortal/core/WS/GPFactor 
>
> yService
> EPR_FileName=WorkerEPR.txt
> FalkonStatePollTime_sec=15
>
> #GRAM4 details
> GRAM4_Location=tg-grid1.uc.teragrid.org
> GRAM4_FactoryType=PBS
> #GRAM4_FactoryType=FORK
> #GRAM4_FactoryType=LSF
> #GRAM4_FactoryType=CONDOR
>
> #project accounting information
> Project=TG-STA040017N
> #Project=default
>
> #Executor script
> ExecutorScript=run.worker.sh
>
> #Security Descriptor File
> SecurityFile=etc/client-security-config.xml
>
> #logging
> DRP_Log=logs/drp-status.txt
>
> #enable debug statements
> #DEBUG=true
> DEBUG=false
> DIPERF=false
> #DIPERF=true
>
>
>
>
>
> -------- Original Message --------
> Subject: PBS JOB 1512406.tg-master.uc.teragrid.org
> Date: Wed,  5 Sep 2007 14:46:17 -0500 (CDT)
> From: adm at tg-master.uc.teragrid.org (root)
> To: wilde at tg-grid1.uc.teragrid.org
>
> PBS Job Id: 1512406.tg-master.uc.teragrid.org
> Job Name:   STDIN
> An error has occurred processing your job, see below.
> Post job file processing error; job 1512406.tg-master.uc.teragrid.org 
> on host tg-v082/0+tg-v076/0+tg-v053/0+tg-v040/0+tg-v034/0Unknown 
> resource type  REJHOST=tg-v082.uc.teragrid.org MSG=invalid home 
> directory '/home/wilde' specified, errno=2 (No such file or directory)
>
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>

-- 
============================================
Ioan Raicu
Ph.D. Student
============================================
Distributed Systems Laboratory
Computer Science Department
University of Chicago
1100 E. 58th Street, Ryerson Hall
Chicago, IL 60637
============================================
Email: iraicu at cs.uchicago.edu
Web:   http://www.cs.uchicago.edu/~iraicu
       http://dsl.cs.uchicago.edu/
============================================
============================================




More information about the Swift-devel mailing list