[Swift-devel] Getting PBS email for Falkon failures - no $HOME?

Michael Wilde wilde at mcs.anl.gov
Wed Sep 5 15:05:36 CDT 2007


Ioan, I'm doing a run of 1000 angle jobs, with worker throttling 
parameters as below.  Im seeing about 20 Falkon jobs running with about 
50+ nodes total allocated among them (in various numbers from about 1 to 5).

But I also get in this run 8 pairs of email messages from PBS, where the 
first says something like this:

PBS Job Id: 1512406.tg-master.uc.teragrid.org
Job Name:   STDIN
Aborted by PBS Server
Job cannot be executed
See job standard error file

and the second of each pair says:

PBS Job Id: 1512406.tg-master.uc.teragrid.org
Job Name:   STDIN
An error has occurred processing your job, see below.
Post job file processing error; job 1512406.tg-master.uc.teragrid.org on 
host tg-v082/0+tg-v076/0+tg-v053/0+tg-v040/0+tg-v034/0Unknown resource 
type  REJHOST=tg-v082.uc.teragrid.org MSG=invalid home directory 
'/home/wilde' specified, errno=2 (No such file or directory)

On the surface, it looks like some kind of mount or automount failure on 
various worker nodes - they cant see my TG home directory?

Is that likely?

Do you see such messages as well? Ben, do you?

My Provisioner.config file has:

#Provisioner config file
#KEY=VALUE
#if multiple lines have the same key, the previous value will be 
overwritten with the new valu
e
#all paths are relative

#resources numbers
MinNumExecutors=0
MaxNumExecutors=250
ExecutorsPerHost=2

#resources times
MinResourceAllocationTime_min=60
MaxResourceAllocationTime_min=60

#resources types
HostType=any
#HostType=ia32_compute
#HostType=ia64_compute

#allocation strategies
#AllocationStrategy=one_at_a_time
#AllocationStrategy=additive
#AllocationStrategy=exponential
AllocationStrategy=additive
MinNumHostsPerAllocation=10
MaxNumHostsPerAllocation=100

#de-allocation strategies, 0 means never de-allocate due to idle time
DeAllocationIdleTime_sec=300000
# ^^^^ in msec 300,000 = 300 secs = 5 min

#Falkon information
FalkonServiceURI=http://tg-viz-login1.uc.teragrid.org:50011/wsrf/services/GenericPortal/core/W
S/GPFactoryService
#FalkonServiceURI=http://viper.uchicago.edu:50001/wsrf/services/GenericPortal/core/WS/GPFactor
yService
EPR_FileName=WorkerEPR.txt
FalkonStatePollTime_sec=15

#GRAM4 details
GRAM4_Location=tg-grid1.uc.teragrid.org
GRAM4_FactoryType=PBS
#GRAM4_FactoryType=FORK
#GRAM4_FactoryType=LSF
#GRAM4_FactoryType=CONDOR

#project accounting information
Project=TG-STA040017N
#Project=default

#Executor script
ExecutorScript=run.worker.sh

#Security Descriptor File
SecurityFile=etc/client-security-config.xml

#logging
DRP_Log=logs/drp-status.txt

#enable debug statements
#DEBUG=true
DEBUG=false
DIPERF=false
#DIPERF=true





-------- Original Message --------
Subject: PBS JOB 1512406.tg-master.uc.teragrid.org
Date: Wed,  5 Sep 2007 14:46:17 -0500 (CDT)
From: adm at tg-master.uc.teragrid.org (root)
To: wilde at tg-grid1.uc.teragrid.org

PBS Job Id: 1512406.tg-master.uc.teragrid.org
Job Name:   STDIN
An error has occurred processing your job, see below.
Post job file processing error; job 1512406.tg-master.uc.teragrid.org on 
host tg-v082/0+tg-v076/0+tg-v053/0+tg-v040/0+tg-v034/0Unknown resource 
type  REJHOST=tg-v082.uc.teragrid.org MSG=invalid home directory 
'/home/wilde' specified, errno=2 (No such file or directory)





More information about the Swift-devel mailing list