[Swift-devel] Error 521 provider-staging files to PADS nodes

Michael Wilde wilde at mcs.anl.gov
Wed Jan 19 13:12:14 CST 2011


Mihael,

The following test on pads failed/hung with an error 521 from worker.pl:

---
sub getFileCBDataInIndirect {
    ...
    elsif ($timeout) {
        queueCmd((nullCB(), "JOBSTATUS", $jobid, FAILED, "521", "Timeout staging in file"));
        delete($JOBDATA{$jobid});
---

single foreach loop, doing 1,000 "mv" commands

throttle was 200 jobs to this coaster pool (1 4-node 32-core PBS job):

  <pool handle="localhost" sysinfo="INTEL32::LINUX">
    <execution provider="coaster" url="" jobmanager="local:pbs"/>
    <profile namespace="globus" key="workersPerNode">8</profile>
    <profile namespace="globus" key="maxTime">3500</profile>
    <profile namespace="globus" key="slots">1</profile>
    <profile namespace="globus" key="nodeGranularity">4</profile>
    <profile namespace="globus" key="maxNodes">4</profile>
    <profile namespace="globus" key="queue">short</profile>
    <profile namespace="karajan" key="jobThrottle">2.0</profile>
    <profile namespace="karajan" key="initialScore">10000</profile>
    <filesystem provider="local"/>
    <workdirectory>/scratch/local/wilde/test/swiftwork</workdirectory>
    <profile namespace="swift" key="stagingMethod">file</profile>
    <scratch>/scratch/local/wilde/swiftscratch</scratch>
  </pool>

Ran 33 jobs - 1 job over 1 "wave" of 32 and then one or more workers timed out. Note that the hang may have happened earlier, as no new jobs were starting as the jobs in the first wave were finishing.

time swift -tc.file tc -sites.file pbscoasters.xml -config cf.ps mvn.swift -n=1000 >& out &


The log is in ~wilde/mvn-20110119-0956-s3s8h9h2.log on CI net.

Swift stdout showed the following after waiting a while for a 4-node PADS coaster allocation to start:

Progress:  Selecting site:799  Submitted:201
Progress:  Selecting site:799  Submitted:201
Progress:  Selecting site:799  Submitted:200  Active:1
Progress:  Selecting site:798  Submitted:177  Active:24  Finished successfully:1
Progress:  Selecting site:796  Submitted:172  Active:28  Finished successfully:4
Progress:  Selecting site:792  Submitted:176  Active:24  Finished successfully:8
Progress:  Selecting site:788  Submitted:180  Active:20  Finished successfully:12
Progress:  Selecting site:784  Submitted:184  Active:16  Finished successfully:16
Progress:  Selecting site:780  Submitted:188  Active:12  Finished successfully:20
Progress:  Selecting site:777  Submitted:191  Active:9  Finished successfully:23
Progress:  Selecting site:773  Submitted:195  Active:5  Finished successfully:27
Progress:  Selecting site:770  Submitted:197  Active:3  Finished successfully:30
Progress:  Selecting site:767  Submitted:200  Finished successfully:33
Progress:  Selecting site:766  Submitted:201  Finished successfully:33
Progress:  Selecting site:766  Submitted:201  Finished successfully:33
Progress:  Selecting site:766  Submitted:201  Finished successfully:33
Progress:  Selecting site:766  Submitted:201  Finished successfully:33
Progress:  Selecting site:766  Submitted:201  Finished successfully:33
Progress:  Selecting site:766  Submitted:201  Finished successfully:33
Progress:  Selecting site:766  Submitted:200  Active:1  Finished successfully:33
Execution failed:
        Job failed with an exit code of 521
login1$ 
login1$ 
login1$ pwd
/scratch/local/wilde/lab
login1$ ls -lt | head
total 51408
-rw-r--r-- 1 wilde ci-users  5043350 Jan 19 10:51 mvn-20110119-0956-s3s8h9h2.log

(copied to ~wilde)

script was:

login1$ cat mvn.swift
type file;

app (file o) mv (file i)
{
  mv @i @o;
}

file out[]<simple_mapper; location="outdir", prefix="f.",suffix=".out">;
foreach j in [1:@toint(@arg("n","1"))] {
  file data<"data.txt">;
  out[j] = mv(data);
}


data.txt was 3MB

A look at the outdir gives a clue to where things hung:  The files of <= ~3MB from time 10:48 are from this job.  Files from 10:39 and earlier are from other manual runs executed on login1, Note that 3 of the 3MB output files have length 0 or <3MB, and were likely in transit back from the worker:

-rw-r--r-- 1 wilde ci-users  2686976 Jan 19 10:48 f.0125.out
-rw-r--r-- 1 wilde ci-users  2621440 Jan 19 10:48 f.0167.out
-rw-r--r-- 1 wilde ci-users        0 Jan 19 10:48 f.0259.out


login1$ pwd
/scratch/local/wilde/lab
login1$ cd outdir
login1$ ls -lt | head -40
total 2772188
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0023.out
-rw-r--r-- 1 wilde ci-users  2686976 Jan 19 10:48 f.0125.out
-rw-r--r-- 1 wilde ci-users  2621440 Jan 19 10:48 f.0167.out
-rw-r--r-- 1 wilde ci-users        0 Jan 19 10:48 f.0259.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0336.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0380.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0015.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0204.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0379.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0066.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0221.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0281.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0403.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0142.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0187.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0067.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0081.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0134.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0136.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0146.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0254.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0362.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0312.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0370.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0389.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0027.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0094.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0183.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0363.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0016.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0025.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0429.out
-rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0239.out
-rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0024.out
-rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0037.out
-rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0001.out
-rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0042.out
-rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0033.out
-rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0051.out
l

- Mike


-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list