[Swift-devel] Getting PBS email for Falkon failures - no $HOME?
Michael Wilde
wilde at mcs.anl.gov
Wed Sep 5 15:05:36 CDT 2007
Ioan, I'm doing a run of 1000 angle jobs, with worker throttling
parameters as below. Im seeing about 20 Falkon jobs running with about
50+ nodes total allocated among them (in various numbers from about 1 to 5).
But I also get in this run 8 pairs of email messages from PBS, where the
first says something like this:
PBS Job Id: 1512406.tg-master.uc.teragrid.org
Job Name: STDIN
Aborted by PBS Server
Job cannot be executed
See job standard error file
and the second of each pair says:
PBS Job Id: 1512406.tg-master.uc.teragrid.org
Job Name: STDIN
An error has occurred processing your job, see below.
Post job file processing error; job 1512406.tg-master.uc.teragrid.org on
host tg-v082/0+tg-v076/0+tg-v053/0+tg-v040/0+tg-v034/0Unknown resource
type REJHOST=tg-v082.uc.teragrid.org MSG=invalid home directory
'/home/wilde' specified, errno=2 (No such file or directory)
On the surface, it looks like some kind of mount or automount failure on
various worker nodes - they cant see my TG home directory?
Is that likely?
Do you see such messages as well? Ben, do you?
My Provisioner.config file has:
#Provisioner config file
#KEY=VALUE
#if multiple lines have the same key, the previous value will be
overwritten with the new valu
e
#all paths are relative
#resources numbers
MinNumExecutors=0
MaxNumExecutors=250
ExecutorsPerHost=2
#resources times
MinResourceAllocationTime_min=60
MaxResourceAllocationTime_min=60
#resources types
HostType=any
#HostType=ia32_compute
#HostType=ia64_compute
#allocation strategies
#AllocationStrategy=one_at_a_time
#AllocationStrategy=additive
#AllocationStrategy=exponential
AllocationStrategy=additive
MinNumHostsPerAllocation=10
MaxNumHostsPerAllocation=100
#de-allocation strategies, 0 means never de-allocate due to idle time
DeAllocationIdleTime_sec=300000
# ^^^^ in msec 300,000 = 300 secs = 5 min
#Falkon information
FalkonServiceURI=http://tg-viz-login1.uc.teragrid.org:50011/wsrf/services/GenericPortal/core/W
S/GPFactoryService
#FalkonServiceURI=http://viper.uchicago.edu:50001/wsrf/services/GenericPortal/core/WS/GPFactor
yService
EPR_FileName=WorkerEPR.txt
FalkonStatePollTime_sec=15
#GRAM4 details
GRAM4_Location=tg-grid1.uc.teragrid.org
GRAM4_FactoryType=PBS
#GRAM4_FactoryType=FORK
#GRAM4_FactoryType=LSF
#GRAM4_FactoryType=CONDOR
#project accounting information
Project=TG-STA040017N
#Project=default
#Executor script
ExecutorScript=run.worker.sh
#Security Descriptor File
SecurityFile=etc/client-security-config.xml
#logging
DRP_Log=logs/drp-status.txt
#enable debug statements
#DEBUG=true
DEBUG=false
DIPERF=false
#DIPERF=true
-------- Original Message --------
Subject: PBS JOB 1512406.tg-master.uc.teragrid.org
Date: Wed, 5 Sep 2007 14:46:17 -0500 (CDT)
From: adm at tg-master.uc.teragrid.org (root)
To: wilde at tg-grid1.uc.teragrid.org
PBS Job Id: 1512406.tg-master.uc.teragrid.org
Job Name: STDIN
An error has occurred processing your job, see below.
Post job file processing error; job 1512406.tg-master.uc.teragrid.org on
host tg-v082/0+tg-v076/0+tg-v053/0+tg-v040/0+tg-v034/0Unknown resource
type REJHOST=tg-v082.uc.teragrid.org MSG=invalid home directory
'/home/wilde' specified, errno=2 (No such file or directory)
More information about the Swift-devel
mailing list