[Swift-devel] Error 521 provider-staging files to PADS nodes

Mihael Hategan hategan at mcs.anl.gov
Wed Jan 19 15:12:04 CST 2011


might be due to one of the recent patches.

you could try to set IOBLOCKSZ to 1 in worker.pl and rerun.

On Wed, 2011-01-19 at 13:37 -0600, Michael Wilde wrote:
> An interesting observation on the returned output files:  there are exactly 33 files in the output dir from this run: the same as the number of jobs Swift reports as Finished successfully.   But of those 33, the last 4 are only of partial length, and one of the 4 is length zero (see below).
> 
> Its surprising and perhaps a bug that the jobs are reported finished before the output file is fully written???
> 
> Also this 3-partial plus 1-zero file looks to me like one worker staging op hung (the oldest of the 4 incomplete output files) and then perhaps 3 were cut short when the coaster service data protocol froze?
> 
> - Mike
> 
> login1$ pwd
> /scratch/local/wilde/lab
> login1$ cd outdir
> login1$ ls -lt | grep 10:48      
> -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0023.out
> -rw-r--r-- 1 wilde ci-users  2686976 Jan 19 10:48 f.0125.out
> -rw-r--r-- 1 wilde ci-users  2621440 Jan 19 10:48 f.0167.out
> -rw-r--r-- 1 wilde ci-users        0 Jan 19 10:48 f.0259.out
> -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0336.out
> -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0380.out
> -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0015.out
> -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0204.out
> -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0379.out
> -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0066.out
> -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0221.out
> -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0281.out
> -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0403.out
> -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0142.out
> -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0187.out
> -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0067.out
> -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0081.out
> -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0134.out
> -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0136.out
> -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0146.out
> -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0254.out
> -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0362.out
> -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0312.out
> -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0370.out
> -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0389.out
> -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0027.out
> -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0094.out
> -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0183.out
> -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0363.out
> -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0016.out
> -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0025.out
> -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0429.out
> -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0239.out
> login1$ ls -lt | grep 10:48 | wc -l
> 33
> login1$ 
> 
> 
> 
> 
> ----- Original Message -----
> > Mihael,
> > 
> > The following test on pads failed/hung with an error 521 from
> > worker.pl:
> > 
> > ---
> > sub getFileCBDataInIndirect {
> > ...
> > elsif ($timeout) {
> > queueCmd((nullCB(), "JOBSTATUS", $jobid, FAILED, "521", "Timeout
> > staging in file"));
> > delete($JOBDATA{$jobid});
> > ---
> > 
> > single foreach loop, doing 1,000 "mv" commands
> > 
> > throttle was 200 jobs to this coaster pool (1 4-node 32-core PBS job):
> > 
> > <pool handle="localhost" sysinfo="INTEL32::LINUX">
> > <execution provider="coaster" url="" jobmanager="local:pbs"/>
> > <profile namespace="globus" key="workersPerNode">8</profile>
> > <profile namespace="globus" key="maxTime">3500</profile>
> > <profile namespace="globus" key="slots">1</profile>
> > <profile namespace="globus" key="nodeGranularity">4</profile>
> > <profile namespace="globus" key="maxNodes">4</profile>
> > <profile namespace="globus" key="queue">short</profile>
> > <profile namespace="karajan" key="jobThrottle">2.0</profile>
> > <profile namespace="karajan" key="initialScore">10000</profile>
> > <filesystem provider="local"/>
> > <workdirectory>/scratch/local/wilde/test/swiftwork</workdirectory>
> > <profile namespace="swift" key="stagingMethod">file</profile>
> > <scratch>/scratch/local/wilde/swiftscratch</scratch>
> > </pool>
> > 
> > Ran 33 jobs - 1 job over 1 "wave" of 32 and then one or more workers
> > timed out. Note that the hang may have happened earlier, as no new
> > jobs were starting as the jobs in the first wave were finishing.
> > 
> > time swift -tc.file tc -sites.file pbscoasters.xml -config cf.ps
> > mvn.swift -n=1000 >& out &
> > 
> > 
> > The log is in ~wilde/mvn-20110119-0956-s3s8h9h2.log on CI net.
> > 
> > Swift stdout showed the following after waiting a while for a 4-node
> > PADS coaster allocation to start:
> > 
> > Progress: Selecting site:799 Submitted:201
> > Progress: Selecting site:799 Submitted:201
> > Progress: Selecting site:799 Submitted:200 Active:1
> > Progress: Selecting site:798 Submitted:177 Active:24 Finished
> > successfully:1
> > Progress: Selecting site:796 Submitted:172 Active:28 Finished
> > successfully:4
> > Progress: Selecting site:792 Submitted:176 Active:24 Finished
> > successfully:8
> > Progress: Selecting site:788 Submitted:180 Active:20 Finished
> > successfully:12
> > Progress: Selecting site:784 Submitted:184 Active:16 Finished
> > successfully:16
> > Progress: Selecting site:780 Submitted:188 Active:12 Finished
> > successfully:20
> > Progress: Selecting site:777 Submitted:191 Active:9 Finished
> > successfully:23
> > Progress: Selecting site:773 Submitted:195 Active:5 Finished
> > successfully:27
> > Progress: Selecting site:770 Submitted:197 Active:3 Finished
> > successfully:30
> > Progress: Selecting site:767 Submitted:200 Finished successfully:33
> > Progress: Selecting site:766 Submitted:201 Finished successfully:33
> > Progress: Selecting site:766 Submitted:201 Finished successfully:33
> > Progress: Selecting site:766 Submitted:201 Finished successfully:33
> > Progress: Selecting site:766 Submitted:201 Finished successfully:33
> > Progress: Selecting site:766 Submitted:201 Finished successfully:33
> > Progress: Selecting site:766 Submitted:201 Finished successfully:33
> > Progress: Selecting site:766 Submitted:200 Active:1 Finished
> > successfully:33
> > Execution failed:
> > Job failed with an exit code of 521
> > login1$
> > login1$
> > login1$ pwd
> > /scratch/local/wilde/lab
> > login1$ ls -lt | head
> > total 51408
> > -rw-r--r-- 1 wilde ci-users 5043350 Jan 19 10:51
> > mvn-20110119-0956-s3s8h9h2.log
> > 
> > (copied to ~wilde)
> > 
> > script was:
> > 
> > login1$ cat mvn.swift
> > type file;
> > 
> > app (file o) mv (file i)
> > {
> > mv @i @o;
> > }
> > 
> > file out[]<simple_mapper; location="outdir",
> > prefix="f.",suffix=".out">;
> > foreach j in [1:@toint(@arg("n","1"))] {
> > file data<"data.txt">;
> > out[j] = mv(data);
> > }
> > 
> > 
> > data.txt was 3MB
> > 
> > A look at the outdir gives a clue to where things hung: The files of
> > <= ~3MB from time 10:48 are from this job. Files from 10:39 and
> > earlier are from other manual runs executed on login1, Note that 3 of
> > the 3MB output files have length 0 or <3MB, and were likely in transit
> > back from the worker:
> > 
> > -rw-r--r-- 1 wilde ci-users 2686976 Jan 19 10:48 f.0125.out
> > -rw-r--r-- 1 wilde ci-users 2621440 Jan 19 10:48 f.0167.out
> > -rw-r--r-- 1 wilde ci-users 0 Jan 19 10:48 f.0259.out
> > 
> > 
> > login1$ pwd
> > /scratch/local/wilde/lab
> > login1$ cd outdir
> > login1$ ls -lt | head -40
> > total 2772188
> > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0023.out
> > -rw-r--r-- 1 wilde ci-users 2686976 Jan 19 10:48 f.0125.out
> > -rw-r--r-- 1 wilde ci-users 2621440 Jan 19 10:48 f.0167.out
> > -rw-r--r-- 1 wilde ci-users 0 Jan 19 10:48 f.0259.out
> > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0336.out
> > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0380.out
> > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0015.out
> > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0204.out
> > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0379.out
> > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0066.out
> > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0221.out
> > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0281.out
> > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0403.out
> > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0142.out
> > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0187.out
> > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0067.out
> > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0081.out
> > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0134.out
> > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0136.out
> > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0146.out
> > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0254.out
> > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0362.out
> > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0312.out
> > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0370.out
> > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0389.out
> > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0027.out
> > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0094.out
> > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0183.out
> > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0363.out
> > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0016.out
> > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0025.out
> > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0429.out
> > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0239.out
> > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0024.out
> > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0037.out
> > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0001.out
> > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0042.out
> > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0033.out
> > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0051.out
> > l
> > 
> > - Mike
> > 
> > 
> > --
> > Michael Wilde
> > Computation Institute, University of Chicago
> > Mathematics and Computer Science Division
> > Argonne National Laboratory
> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 





More information about the Swift-devel mailing list