[Swift-user] hung submission

Altaweel, Mark m.altaweel at ucl.ac.uk
Sun May 3 13:43:31 CDT 2015


Thanks Yadu.

Yes I did check and digging in it seems to fail :

6596817 2.69388 B0503-3707 tcrnma3      Eqw   05/03/2015 19:37:48

And then if I look at the reason (qstat -j) I get this (basically the error reason shows a truncated version of my file submitted):

Seems odd that it shortens the path or at least indicates that it does this.

Mark


job_number:                 6596817
exec_file:                  job_scripts/6596817
submission_time:            Sun May  3 19:37:48 2015
owner:                      tcrnma3
uid:                        147447
group:                      users
gid:                        1002
sge_o_home:                 /home/tcrnma3/
sge_o_log_name:             tcrnma3
sge_o_path:                 /shared/ucl/apps/Java/64/jdk1.7.0_45/bin:/shared/ucl/apps/mrxvt/0.5.4/bin:/shared/ucl/apps/nedit/5.6/bin:/shared/ucl/apps/gerun/i:/usr/mpi/qlogic//sbin:/usr/mpi/qlogic//bin:/usr/lib64/qt-3.3/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/sbin:/usr/sbin:/shared/ucl/apps/bin:/cm/shared/apps/intel/toolkit/Compiler/11.1/072//bin:/cm/shared/apps/intel/toolkit/Compiler/11.1/072//bin/intel64:/cm/shared/apps/sge/6.2u3/bin/lx26-amd64:/home/tcrnma3//bin:/home/tcrnma3//Scratch/swift-0.96-sge-mod/bin:/sbin
sge_o_shell:                /bin/bash
sge_o_workdir:              /imports/home1/tcrnma3/Scratch/UrbanModel
sge_o_host:                 login08
account:                    ucl_jsv4h;S=0;T=1.0;W=1.0;X=1.0;Y=1.0;V=0;Z=1.0;U=1.0
stderr_path_list:           NONE:NONE:/imports/home1/tcrnma3/Scratch/UrbanModel/run005/scripts/SGE7948718974736431209.submit.stderr
hard resource_list:         batch=true,bonus=0,h_rt=540,jcs=0,jct=1,jcu=1,jcv=0,jcw=1,jcx=1,jcy=1,jcz=1,maxversion=2,memory=1M,penalty=604801,s_rt=530
mail_list:                  tcrnma3 at login08.data.legion.ucl.ac.uk<mailto:tcrnma3 at login08.data.legion.ucl.ac.uk>
notify:                     FALSE
job_name:                   B0503-3707460-0
stdout_path_list:           NONE:NONE:/imports/home1/tcrnma3/Scratch/UrbanModel/run005/scripts/SGE7948718974736431209.submit.stdout
jobshare:                   0
restart:                    n
shell_list:                 NONE:/bin/ksh
env_list:                   WORKER_LOGGING_LEVEL=NONE,XAUTHORITY=/scratch/scratch/tcrnma3/.Xauthority,PAID=0,GPU=0,OMP_NUM_THREADS=1,MICCOUNT=0,SCRATCH_SPACE=10737418240,MEMPERSLOT=1048576,SGE_SHARENODE=1,IFS=
script_file:                /imports/home1/tcrnma3/Scratch/UrbanModel/run005/scripts/SGE7948718974736431209.submit
project:                    AllUsers
error reason    1:          05/03/2015 19:38:15 [147447:22761]: error: can't open output file "/imports/home1/tcrnma3/Scratch/Ur
scheduling info:            (Collecting of scheduler job information is turned off)
On May 3, 2015, at 2:32 PM, Yadu Nand Babuji <yadunand at uchicago.edu<mailto:yadunand at uchicago.edu>> wrote:

Hi Mark,

What you are seeing is progress reports from swift at an interval of 30s, and all this
indicates is that your jobs were submitted to the queue for execution. Until the local resource
manager, in this case the SGE scheduler starts the execution of jobs swift will have to wait.
>From you description all I can gather is that you are seeing long wait times, with no indications
of a any failure.

Could you check if you can spot the jobs submitted by swift to the queue ? For this, open
a separate terminal on the login node while your swift run is waiting in submitted state,
and use qstat to see your jobs.

[coursa1 at login06 part05]$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
6593408 0.00000 B0503-2802 coursa1      qw    05/03/2015 14:28:40                                    1
6593409 0.00000 B0503-2802 coursa1      qw    05/03/2015 14:28:41                                    1

The qw state indicates that your jobs are waiting in the queue.

Thanks,
Yadu


On 05/03/2015 01:11 AM, Altaweel, Mark wrote:
Hi,

I tried executing Swift on our institutions’s sge-based cluster and the submission seems hung or not executing properly. It has the following message:

Swift 0.96-RC1 git-rev: c7a1dc478a40865f5639f186284697d53978bd48 heads/release-0.96-swift 6274 (modified locally)
RunID: run002
Progress: Sun, 03 May 2015 07:00:29+0100
Number of parameter combinations: 2
Stride: 1
Begin: 1, End: 1
Begin: 2, End: 2
Progress: Sun, 03 May 2015 07:00:30+0100  Submitted:2
Error: No parallel environment specified
Progress: Sun, 03 May 2015 07:01:00+0100  Submitted:2
Progress: Sun, 03 May 2015 07:01:30+0100  Submitted:2
Progress: Sun, 03 May 2015 07:02:00+0100  Submitted:2
Progress: Sun, 03 May 2015 07:02:30+0100  Submitted:2
Progress: Sun, 03 May 2015 07:03:00+0100  Submitted:2
Progress: Sun, 03 May 2015 07:03:30+0100  Submitted:2
Progress: Sun, 03 May 2015 07:04:00+0100  Submitted:2
Progress: Sun, 03 May 2015 07:04:30+0100  Submitted:2

This is just repeated and does not seem to stop

The log file has the following messages, which also repeat:

2015-05-03 07:08:22,401+0100 INFO  RuntimeStats$ProgressTicker HeapMax: 954728448, CrtHeap: 378535936, UsedHeap: 64559392, JVMThreads: 52
2015-05-03 07:08:23,401+0100 INFO  RuntimeStats$ProgressTicker HeapMax: 954728448, CrtHeap: 378535936, UsedHeap: 64559432, JVMThreads: 52
2015-05-03 07:08:23,709+0100 INFO  AbstractQueuePoller Actively monitored: 1, New: 0, Done: 0
2015-05-03 07:08:24,401+0100 INFO  RuntimeStats$ProgressTicker HeapMax: 954728448, CrtHeap: 378535936, UsedHeap: 64584080, JVMThreads: 52
2015-05-03 07:08:25,401+0100 INFO  RuntimeStats$ProgressTicker HeapMax: 954728448, CrtHeap: 378535936, UsedHeap: 64584120, JVMThreads: 52
2015-05-03 07:08:26,401+0100 INFO  RuntimeStats$ProgressTicker HeapMax: 954728448, CrtHeap: 378535936, UsedHeap: 64584160, JVMThreads: 52
2015-05-03 07:08:27,401+0100 INFO  RuntimeStats$ProgressTicker HeapMax: 954728448, CrtHeap: 378535936, UsedHeap: 64584200, JVMThreads: 52
2015-05-03 07:08:28,401+0100 INFO  RuntimeStats$ProgressTicker HeapMax: 954728448, CrtHeap: 378535936, UsedHeap: 64584240, JVMThreads: 52
2015-05-03 07:08:29,401+0100 INFO  RuntimeStats$ProgressTicker HeapMax: 954728448, CrtHeap: 378535936, UsedHeap: 64584280, JVMThreads: 52


I did run this locally to see if anything is wrong with the submission and it worked fine with proper output.

Thank you.

Mark





_______________________________________________
Swift-user mailing list
Swift-user at ci.uchicago.edu<mailto:Swift-user at ci.uchicago.edu>
https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user

_______________________________________________
Swift-user mailing list
Swift-user at ci.uchicago.edu<mailto:Swift-user at ci.uchicago.edu>
https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20150503/2ff0d81b/attachment.html>


More information about the Swift-user mailing list