[Swift-user] hung submission
Altaweel, Mark
m.altaweel at ucl.ac.uk
Sun May 3 16:41:59 CDT 2015
Hi again,
I get this on the local:
/dev/sda1 on / type ext3 (rw,noatime,nodiratime) none on /proc type proc (rw,nosuid) none on /sys type sysfs (rw) none on /dev/pts type devpts (rw,gid=5,mode=620) /dev/sda2 on /var type ext3 (rw,noatime,nodiratime) none on /dev/shm type tmpfs (rw) none on /tmp type tmpfs (rw,nodev,noatime,nodiratime,size=32g) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) nfs:/exports/cmshared on /cm/shared type nfs (rw,noatime,rsize=32768,wsize=32768,hard,intr,addr=10.143.0.14) nfs:/exports/home on /home type nfs (rw,noatime,rsize=32768,wsize=32768,hard,intr,acdirmin=0,acdirmax=0,acregmax=10,noac,addr=10.143.0.14) nfs:/exports/home0 on /imports/home0 type nfs (rw,noatime,rsize=32768,wsize=32768,hard,intr,acdirmin=0,acdirmax=0,acregmax=10,noac,addr=10.143.0.14) nfs:/exports/home1 on /imports/home1 type nfs (rw,noatime,rsize=32768,wsize=32768,hard,intr,acdirmin=0,acdirmax=0,acregmax=10,noac,addr=10.143.0.14) nfs:/exports/homeL on /imports/homeL type nfs (rw,noatime,rsize=32768,wsize=32768,hard,intr,acdirmin=0,acdirmax=0,acregmax=10,noac,addr=10.143.0.14) nfs:/exports/software on /shared type nfs (ro,rsize=32768,wsize=32768,hard,intr,addr=10.143.0.14) nfs:/exports/sge on /cm/shared/apps/sge type nfs (rw,noatime,rsize=32768,wsize=32768,hard,intr,addr=10.143.0.14) nfs:/exports/lcgsoft on /imports/lcgsoft type nfs (rw,noatime,rsize=32768,wsize=32768,hard,intr,addr=10.143.0.14) nfs:/exports/deptapps on /imports/deptapps type nfs (rw,noatime,rsize=32768,wsize=32768,hard,intr,acdirmin=0,acdirmax=0,acregmax=10,noac,addr=10.143.0.14) nfsd on /proc/fs/nfsd type nfsd (rw) none on /dev/cpuset type cpuset (rw) 10.143.0.127 at tcp:10.143.0.126 at tcp:/scratch on /scratch type lustre (rw,_netdev,noatime,nodiratime,flock)
Don’t really see the output on the compute node.
Mark
On May 3, 2015, at 9:30 PM, Mihael Hategan <hategan at mcs.anl.gov<mailto:hategan at mcs.anl.gov>> wrote:
Hi,
That's actually good since we eliminated lots of moving parts.
~/Scratch seems to be the right spot according to
https://wiki.rc.ucl.ac.uk/wiki/Managing_Data_on_Legion
What I suspect might be happening is that the mountpoints are different
between login nodes and compute nodes.
Can you try running these on both the login node and a compute node:
mount (or df)
ls -al $HOME/Scratch
and then pasting the outputs back in an email.
Mihael
On Sun, 2015-05-03 at 20:18 +0000, Altaweel, Mark wrote:
If I do a qsub on the script I get the same error message:
job_number: 6597054
exec_file: job_scripts/6597054
submission_time: Sun May 3 21:15:23 2015
owner: tcrnma3
uid: 147447
group: users
gid: 1002
sge_o_home: /home/tcrnma3/
sge_o_log_name: tcrnma3
sge_o_path: /shared/ucl/apps/mrxvt/0.5.4/bin:/shared/ucl/apps/nedit/5.6/bin:/shared/ucl/apps/gerun/i:/usr/mpi/qlogic//sbin:/usr/mpi/qlogic//bin:/usr/lib64/qt-3.3/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/sbin:/usr/sbin:/shared/ucl/apps/bin:/cm/shared/apps/intel/toolkit/Compiler/11.1/072//bin:/cm/shared/apps/intel/toolkit/Compiler/11.1/072//bin/intel64:/cm/shared/apps/sge/6.2u3/bin/lx26-amd64:/home/tcrnma3//bin
sge_o_shell: /bin/bash
sge_o_workdir: /imports/home1/tcrnma3/Scratch/UrbanModel/run005/scripts
sge_o_host: login06
account: ucl_jsv4h;S=0;T=1.0;W=1.0;X=1.0;Y=1.0;V=0;Z=1.0;U=1.0
stderr_path_list: NONE:NONE:/imports/home1/tcrnma3/Scratch/UrbanModel/run005/scripts/SGE7948718974736431209.submit.stderr
hard resource_list: batch=true,bonus=0,h_rt=540,jcs=0,jct=1,jcu=1,jcv=0,jcw=1,jcx=1,jcy=1,jcz=1,maxversion=2,memory=1M,penalty=604801,s_rt=530
mail_list: tcrnma3 at login06.data.legion.ucl.ac.uk<mailto:tcrnma3 at login06.data.legion.ucl.ac.uk><mailto:tcrnma3 at login06.data.legion.ucl.ac.uk>
notify: FALSE
job_name: B0503-3707460-0
stdout_path_list: NONE:NONE:/imports/home1/tcrnma3/Scratch/UrbanModel/run005/scripts/SGE7948718974736431209.submit.stdout
jobshare: 0
restart: n
shell_list: NONE:/bin/ksh
env_list: WORKER_LOGGING_LEVEL=NONE,XAUTHORITY=/scratch/scratch/tcrnma3/.Xauthority,PAID=0,GPU=0,OMP_NUM_THREADS=1,MICCOUNT=0,SCRATCH_SPACE=10737418240,MEMPERSLOT=1048576,SGE_SHARENODE=1,IFS=
script_file: SGE7948718974736431209.submit
project: AllUsers
error reason 1: 05/03/2015 21:15:57 [147447:18805]: error: can't open output file "/imports/home1/tcrnma3/Scratch/Ur
scheduling info: (Collecting of scheduler job information is turned off)
Mark
On May 3, 2015, at 9:06 PM, Mihael Hategan <hategan at mcs.anl.gov<mailto:hategan at mcs.anl.gov><mailto:hategan at mcs.anl.gov>> wrote:
It seems that it is more likely that the error message gets truncated
rather than the path itself. After all, stdout_path_list does contain
what seems to be the correct path.
There should be a
script: /imports/home1/tcrnma3/Scratch/UrbanModel/run005/scripts/SGE7948718974736431209.submit
(or similar) that should be available while a swift run is in progress.
I think one way to troubleshoot things would be to copy that script and
submit it manually.
Mihael
On Sun, 2015-05-03 at 19:20 +0000, Altaweel, Mark wrote:
Yes so I do import swift in the shell script that gets distributed. However, same conclusion it seems. I don’t understand why it truncates the path, unless it is there but only writes a certain number of the characters.
This is added to the script:
export PATH=$PATH:~/Scratch/swift-0.96-sge-mod/bin
module load java/1.7.0_45
So java is included. If I remove it same thing happens though.
Mark
On May 3, 2015, at 8:07 PM, Mihael Hategan <hategan at mcs.anl.gov<mailto:hategan at mcs.anl.gov><mailto:hategan at mcs.anl.gov><mailto:hategan at mcs.anl.gov>> wrote:
On Sun, 2015-05-03 at 18:43 +0000, Altaweel, Mark wrote:
error reason 1: 05/03/2015 19:38:15 [147447:22761]: error: can't open output file "/imports/home1/tcrnma3/Scratch/Ur
... aaand my PE suggestion had little to do with the problem.
Is /imports mounted on compute nodes?
Mihael
_______________________________________________
Swift-user mailing list
Swift-user at ci.uchicago.edu<mailto:Swift-user at ci.uchicago.edu><mailto:Swift-user at ci.uchicago.edu><mailto:Swift-user at ci.uchicago.edu>
https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20150503/6b76be65/attachment.html>
More information about the Swift-user
mailing list