[Swift-devel] Error 521 provider-staging files to PADS nodes

Michael Wilde wilde at mcs.anl.gov
Wed Jan 19 13:37:15 CST 2011


An interesting observation on the returned output files:  there are exactly 33 files in the output dir from this run: the same as the number of jobs Swift reports as Finished successfully.   But of those 33, the last 4 are only of partial length, and one of the 4 is length zero (see below).

Its surprising and perhaps a bug that the jobs are reported finished before the output file is fully written???

Also this 3-partial plus 1-zero file looks to me like one worker staging op hung (the oldest of the 4 incomplete output files) and then perhaps 3 were cut short when the coaster service data protocol froze?

- Mike

login1$ pwd
/scratch/local/wilde/lab
login1$ cd outdir
login1$ ls -lt | grep 10:48      
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0023.out
-rw-r--r-- 1 wilde ci-users  2686976 Jan 19 10:48 f.0125.out
-rw-r--r-- 1 wilde ci-users  2621440 Jan 19 10:48 f.0167.out
-rw-r--r-- 1 wilde ci-users        0 Jan 19 10:48 f.0259.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0336.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0380.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0015.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0204.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0379.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0066.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0221.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0281.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0403.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0142.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0187.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0067.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0081.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0134.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0136.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0146.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0254.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0362.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0312.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0370.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0389.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0027.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0094.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0183.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0363.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0016.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0025.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0429.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0239.out
login1$ ls -lt | grep 10:48 | wc -l
33
login1$ 




----- Original Message -----
> Mihael,
> 
> The following test on pads failed/hung with an error 521 from
> worker.pl:
> 
> ---
> sub getFileCBDataInIndirect {
> ...
> elsif ($timeout) {
> queueCmd((nullCB(), "JOBSTATUS", $jobid, FAILED, "521", "Timeout
> staging in file"));
> delete($JOBDATA{$jobid});
> ---
> 
> single foreach loop, doing 1,000 "mv" commands
> 
> throttle was 200 jobs to this coaster pool (1 4-node 32-core PBS job):
> 
> <pool handle="localhost" sysinfo="INTEL32::LINUX">
> <execution provider="coaster" url="" jobmanager="local:pbs"/>
> <profile namespace="globus" key="workersPerNode">8</profile>
> <profile namespace="globus" key="maxTime">3500</profile>
> <profile namespace="globus" key="slots">1</profile>
> <profile namespace="globus" key="nodeGranularity">4</profile>
> <profile namespace="globus" key="maxNodes">4</profile>
> <profile namespace="globus" key="queue">short</profile>
> <profile namespace="karajan" key="jobThrottle">2.0</profile>
> <profile namespace="karajan" key="initialScore">10000</profile>
> <filesystem provider="local"/>
> <workdirectory>/scratch/local/wilde/test/swiftwork</workdirectory>
> <profile namespace="swift" key="stagingMethod">file</profile>
> <scratch>/scratch/local/wilde/swiftscratch</scratch>
> </pool>
> 
> Ran 33 jobs - 1 job over 1 "wave" of 32 and then one or more workers
> timed out. Note that the hang may have happened earlier, as no new
> jobs were starting as the jobs in the first wave were finishing.
> 
> time swift -tc.file tc -sites.file pbscoasters.xml -config cf.ps
> mvn.swift -n=1000 >& out &
> 
> 
> The log is in ~wilde/mvn-20110119-0956-s3s8h9h2.log on CI net.
> 
> Swift stdout showed the following after waiting a while for a 4-node
> PADS coaster allocation to start:
> 
> Progress: Selecting site:799 Submitted:201
> Progress: Selecting site:799 Submitted:201
> Progress: Selecting site:799 Submitted:200 Active:1
> Progress: Selecting site:798 Submitted:177 Active:24 Finished
> successfully:1
> Progress: Selecting site:796 Submitted:172 Active:28 Finished
> successfully:4
> Progress: Selecting site:792 Submitted:176 Active:24 Finished
> successfully:8
> Progress: Selecting site:788 Submitted:180 Active:20 Finished
> successfully:12
> Progress: Selecting site:784 Submitted:184 Active:16 Finished
> successfully:16
> Progress: Selecting site:780 Submitted:188 Active:12 Finished
> successfully:20
> Progress: Selecting site:777 Submitted:191 Active:9 Finished
> successfully:23
> Progress: Selecting site:773 Submitted:195 Active:5 Finished
> successfully:27
> Progress: Selecting site:770 Submitted:197 Active:3 Finished
> successfully:30
> Progress: Selecting site:767 Submitted:200 Finished successfully:33
> Progress: Selecting site:766 Submitted:201 Finished successfully:33
> Progress: Selecting site:766 Submitted:201 Finished successfully:33
> Progress: Selecting site:766 Submitted:201 Finished successfully:33
> Progress: Selecting site:766 Submitted:201 Finished successfully:33
> Progress: Selecting site:766 Submitted:201 Finished successfully:33
> Progress: Selecting site:766 Submitted:201 Finished successfully:33
> Progress: Selecting site:766 Submitted:200 Active:1 Finished
> successfully:33
> Execution failed:
> Job failed with an exit code of 521
> login1$
> login1$
> login1$ pwd
> /scratch/local/wilde/lab
> login1$ ls -lt | head
> total 51408
> -rw-r--r-- 1 wilde ci-users 5043350 Jan 19 10:51
> mvn-20110119-0956-s3s8h9h2.log
> 
> (copied to ~wilde)
> 
> script was:
> 
> login1$ cat mvn.swift
> type file;
> 
> app (file o) mv (file i)
> {
> mv @i @o;
> }
> 
> file out[]<simple_mapper; location="outdir",
> prefix="f.",suffix=".out">;
> foreach j in [1:@toint(@arg("n","1"))] {
> file data<"data.txt">;
> out[j] = mv(data);
> }
> 
> 
> data.txt was 3MB
> 
> A look at the outdir gives a clue to where things hung: The files of
> <= ~3MB from time 10:48 are from this job. Files from 10:39 and
> earlier are from other manual runs executed on login1, Note that 3 of
> the 3MB output files have length 0 or <3MB, and were likely in transit
> back from the worker:
> 
> -rw-r--r-- 1 wilde ci-users 2686976 Jan 19 10:48 f.0125.out
> -rw-r--r-- 1 wilde ci-users 2621440 Jan 19 10:48 f.0167.out
> -rw-r--r-- 1 wilde ci-users 0 Jan 19 10:48 f.0259.out
> 
> 
> login1$ pwd
> /scratch/local/wilde/lab
> login1$ cd outdir
> login1$ ls -lt | head -40
> total 2772188
> -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0023.out
> -rw-r--r-- 1 wilde ci-users 2686976 Jan 19 10:48 f.0125.out
> -rw-r--r-- 1 wilde ci-users 2621440 Jan 19 10:48 f.0167.out
> -rw-r--r-- 1 wilde ci-users 0 Jan 19 10:48 f.0259.out
> -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0336.out
> -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0380.out
> -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0015.out
> -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0204.out
> -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0379.out
> -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0066.out
> -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0221.out
> -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0281.out
> -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0403.out
> -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0142.out
> -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0187.out
> -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0067.out
> -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0081.out
> -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0134.out
> -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0136.out
> -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0146.out
> -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0254.out
> -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0362.out
> -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0312.out
> -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0370.out
> -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0389.out
> -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0027.out
> -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0094.out
> -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0183.out
> -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0363.out
> -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0016.out
> -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0025.out
> -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0429.out
> -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0239.out
> -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0024.out
> -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0037.out
> -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0001.out
> -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0042.out
> -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0033.out
> -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0051.out
> l
> 
> - Mike
> 
> 
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list