[Swift-devel] Error 521 provider-staging files to PADS nodes
Michael Wilde
wilde at mcs.anl.gov
Wed Jan 19 13:12:14 CST 2011
Mihael,
The following test on pads failed/hung with an error 521 from worker.pl:
---
sub getFileCBDataInIndirect {
...
elsif ($timeout) {
queueCmd((nullCB(), "JOBSTATUS", $jobid, FAILED, "521", "Timeout staging in file"));
delete($JOBDATA{$jobid});
---
single foreach loop, doing 1,000 "mv" commands
throttle was 200 jobs to this coaster pool (1 4-node 32-core PBS job):
<pool handle="localhost" sysinfo="INTEL32::LINUX">
<execution provider="coaster" url="" jobmanager="local:pbs"/>
<profile namespace="globus" key="workersPerNode">8</profile>
<profile namespace="globus" key="maxTime">3500</profile>
<profile namespace="globus" key="slots">1</profile>
<profile namespace="globus" key="nodeGranularity">4</profile>
<profile namespace="globus" key="maxNodes">4</profile>
<profile namespace="globus" key="queue">short</profile>
<profile namespace="karajan" key="jobThrottle">2.0</profile>
<profile namespace="karajan" key="initialScore">10000</profile>
<filesystem provider="local"/>
<workdirectory>/scratch/local/wilde/test/swiftwork</workdirectory>
<profile namespace="swift" key="stagingMethod">file</profile>
<scratch>/scratch/local/wilde/swiftscratch</scratch>
</pool>
Ran 33 jobs - 1 job over 1 "wave" of 32 and then one or more workers timed out. Note that the hang may have happened earlier, as no new jobs were starting as the jobs in the first wave were finishing.
time swift -tc.file tc -sites.file pbscoasters.xml -config cf.ps mvn.swift -n=1000 >& out &
The log is in ~wilde/mvn-20110119-0956-s3s8h9h2.log on CI net.
Swift stdout showed the following after waiting a while for a 4-node PADS coaster allocation to start:
Progress: Selecting site:799 Submitted:201
Progress: Selecting site:799 Submitted:201
Progress: Selecting site:799 Submitted:200 Active:1
Progress: Selecting site:798 Submitted:177 Active:24 Finished successfully:1
Progress: Selecting site:796 Submitted:172 Active:28 Finished successfully:4
Progress: Selecting site:792 Submitted:176 Active:24 Finished successfully:8
Progress: Selecting site:788 Submitted:180 Active:20 Finished successfully:12
Progress: Selecting site:784 Submitted:184 Active:16 Finished successfully:16
Progress: Selecting site:780 Submitted:188 Active:12 Finished successfully:20
Progress: Selecting site:777 Submitted:191 Active:9 Finished successfully:23
Progress: Selecting site:773 Submitted:195 Active:5 Finished successfully:27
Progress: Selecting site:770 Submitted:197 Active:3 Finished successfully:30
Progress: Selecting site:767 Submitted:200 Finished successfully:33
Progress: Selecting site:766 Submitted:201 Finished successfully:33
Progress: Selecting site:766 Submitted:201 Finished successfully:33
Progress: Selecting site:766 Submitted:201 Finished successfully:33
Progress: Selecting site:766 Submitted:201 Finished successfully:33
Progress: Selecting site:766 Submitted:201 Finished successfully:33
Progress: Selecting site:766 Submitted:201 Finished successfully:33
Progress: Selecting site:766 Submitted:200 Active:1 Finished successfully:33
Execution failed:
Job failed with an exit code of 521
login1$
login1$
login1$ pwd
/scratch/local/wilde/lab
login1$ ls -lt | head
total 51408
-rw-r--r-- 1 wilde ci-users 5043350 Jan 19 10:51 mvn-20110119-0956-s3s8h9h2.log
(copied to ~wilde)
script was:
login1$ cat mvn.swift
type file;
app (file o) mv (file i)
{
mv @i @o;
}
file out[]<simple_mapper; location="outdir", prefix="f.",suffix=".out">;
foreach j in [1:@toint(@arg("n","1"))] {
file data<"data.txt">;
out[j] = mv(data);
}
data.txt was 3MB
A look at the outdir gives a clue to where things hung: The files of <= ~3MB from time 10:48 are from this job. Files from 10:39 and earlier are from other manual runs executed on login1, Note that 3 of the 3MB output files have length 0 or <3MB, and were likely in transit back from the worker:
-rw-r--r-- 1 wilde ci-users 2686976 Jan 19 10:48 f.0125.out
-rw-r--r-- 1 wilde ci-users 2621440 Jan 19 10:48 f.0167.out
-rw-r--r-- 1 wilde ci-users 0 Jan 19 10:48 f.0259.out
login1$ pwd
/scratch/local/wilde/lab
login1$ cd outdir
login1$ ls -lt | head -40
total 2772188
-rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0023.out
-rw-r--r-- 1 wilde ci-users 2686976 Jan 19 10:48 f.0125.out
-rw-r--r-- 1 wilde ci-users 2621440 Jan 19 10:48 f.0167.out
-rw-r--r-- 1 wilde ci-users 0 Jan 19 10:48 f.0259.out
-rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0336.out
-rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0380.out
-rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0015.out
-rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0204.out
-rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0379.out
-rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0066.out
-rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0221.out
-rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0281.out
-rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0403.out
-rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0142.out
-rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0187.out
-rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0067.out
-rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0081.out
-rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0134.out
-rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0136.out
-rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0146.out
-rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0254.out
-rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0362.out
-rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0312.out
-rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0370.out
-rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0389.out
-rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0027.out
-rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0094.out
-rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0183.out
-rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0363.out
-rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0016.out
-rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0025.out
-rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0429.out
-rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0239.out
-rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0024.out
-rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0037.out
-rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0001.out
-rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0042.out
-rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0033.out
-rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0051.out
l
- Mike
--
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory
More information about the Swift-devel
mailing list