[Swift-devel] Error 521 provider-staging files to PADS nodes

Mihael Hategan hategan at mcs.anl.gov
Sun Jan 23 15:32:14 CST 2011


I'm trying to run tests on pads. The queues aren't quite empty. In the
mean time, I committed a bit of a patch to trunk to measure aggregate
traffic on TCP channels (those are only used by the workers). You can
enable it by setting the "tcp.channel.log.io.performance" system
property to "true".

Mihael

On Wed, 2011-01-19 at 13:12 -0800, Mihael Hategan wrote:
> might be due to one of the recent patches.
> 
> you could try to set IOBLOCKSZ to 1 in worker.pl and rerun.
> 
> On Wed, 2011-01-19 at 13:37 -0600, Michael Wilde wrote:
> > An interesting observation on the returned output files:  there are exactly 33 files in the output dir from this run: the same as the number of jobs Swift reports as Finished successfully.   But of those 33, the last 4 are only of partial length, and one of the 4 is length zero (see below).
> > 
> > Its surprising and perhaps a bug that the jobs are reported finished before the output file is fully written???
> > 
> > Also this 3-partial plus 1-zero file looks to me like one worker staging op hung (the oldest of the 4 incomplete output files) and then perhaps 3 were cut short when the coaster service data protocol froze?
> > 
> > - Mike
> > 
> > login1$ pwd
> > /scratch/local/wilde/lab
> > login1$ cd outdir
> > login1$ ls -lt | grep 10:48      
> > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0023.out
> > -rw-r--r-- 1 wilde ci-users  2686976 Jan 19 10:48 f.0125.out
> > -rw-r--r-- 1 wilde ci-users  2621440 Jan 19 10:48 f.0167.out
> > -rw-r--r-- 1 wilde ci-users        0 Jan 19 10:48 f.0259.out
> > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0336.out
> > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0380.out
> > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0015.out
> > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0204.out
> > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0379.out
> > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0066.out
> > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0221.out
> > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0281.out
> > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0403.out
> > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0142.out
> > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0187.out
> > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0067.out
> > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0081.out
> > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0134.out
> > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0136.out
> > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0146.out
> > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0254.out
> > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0362.out
> > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0312.out
> > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0370.out
> > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0389.out
> > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0027.out
> > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0094.out
> > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0183.out
> > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0363.out
> > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0016.out
> > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0025.out
> > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0429.out
> > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0239.out
> > login1$ ls -lt | grep 10:48 | wc -l
> > 33
> > login1$ 
> > 
> > 
> > 
> > 
> > ----- Original Message -----
> > > Mihael,
> > > 
> > > The following test on pads failed/hung with an error 521 from
> > > worker.pl:
> > > 
> > > ---
> > > sub getFileCBDataInIndirect {
> > > ...
> > > elsif ($timeout) {
> > > queueCmd((nullCB(), "JOBSTATUS", $jobid, FAILED, "521", "Timeout
> > > staging in file"));
> > > delete($JOBDATA{$jobid});
> > > ---
> > > 
> > > single foreach loop, doing 1,000 "mv" commands
> > > 
> > > throttle was 200 jobs to this coaster pool (1 4-node 32-core PBS job):
> > > 
> > > <pool handle="localhost" sysinfo="INTEL32::LINUX">
> > > <execution provider="coaster" url="" jobmanager="local:pbs"/>
> > > <profile namespace="globus" key="workersPerNode">8</profile>
> > > <profile namespace="globus" key="maxTime">3500</profile>
> > > <profile namespace="globus" key="slots">1</profile>
> > > <profile namespace="globus" key="nodeGranularity">4</profile>
> > > <profile namespace="globus" key="maxNodes">4</profile>
> > > <profile namespace="globus" key="queue">short</profile>
> > > <profile namespace="karajan" key="jobThrottle">2.0</profile>
> > > <profile namespace="karajan" key="initialScore">10000</profile>
> > > <filesystem provider="local"/>
> > > <workdirectory>/scratch/local/wilde/test/swiftwork</workdirectory>
> > > <profile namespace="swift" key="stagingMethod">file</profile>
> > > <scratch>/scratch/local/wilde/swiftscratch</scratch>
> > > </pool>
> > > 
> > > Ran 33 jobs - 1 job over 1 "wave" of 32 and then one or more workers
> > > timed out. Note that the hang may have happened earlier, as no new
> > > jobs were starting as the jobs in the first wave were finishing.
> > > 
> > > time swift -tc.file tc -sites.file pbscoasters.xml -config cf.ps
> > > mvn.swift -n=1000 >& out &
> > > 
> > > 
> > > The log is in ~wilde/mvn-20110119-0956-s3s8h9h2.log on CI net.
> > > 
> > > Swift stdout showed the following after waiting a while for a 4-node
> > > PADS coaster allocation to start:
> > > 
> > > Progress: Selecting site:799 Submitted:201
> > > Progress: Selecting site:799 Submitted:201
> > > Progress: Selecting site:799 Submitted:200 Active:1
> > > Progress: Selecting site:798 Submitted:177 Active:24 Finished
> > > successfully:1
> > > Progress: Selecting site:796 Submitted:172 Active:28 Finished
> > > successfully:4
> > > Progress: Selecting site:792 Submitted:176 Active:24 Finished
> > > successfully:8
> > > Progress: Selecting site:788 Submitted:180 Active:20 Finished
> > > successfully:12
> > > Progress: Selecting site:784 Submitted:184 Active:16 Finished
> > > successfully:16
> > > Progress: Selecting site:780 Submitted:188 Active:12 Finished
> > > successfully:20
> > > Progress: Selecting site:777 Submitted:191 Active:9 Finished
> > > successfully:23
> > > Progress: Selecting site:773 Submitted:195 Active:5 Finished
> > > successfully:27
> > > Progress: Selecting site:770 Submitted:197 Active:3 Finished
> > > successfully:30
> > > Progress: Selecting site:767 Submitted:200 Finished successfully:33
> > > Progress: Selecting site:766 Submitted:201 Finished successfully:33
> > > Progress: Selecting site:766 Submitted:201 Finished successfully:33
> > > Progress: Selecting site:766 Submitted:201 Finished successfully:33
> > > Progress: Selecting site:766 Submitted:201 Finished successfully:33
> > > Progress: Selecting site:766 Submitted:201 Finished successfully:33
> > > Progress: Selecting site:766 Submitted:201 Finished successfully:33
> > > Progress: Selecting site:766 Submitted:200 Active:1 Finished
> > > successfully:33
> > > Execution failed:
> > > Job failed with an exit code of 521
> > > login1$
> > > login1$
> > > login1$ pwd
> > > /scratch/local/wilde/lab
> > > login1$ ls -lt | head
> > > total 51408
> > > -rw-r--r-- 1 wilde ci-users 5043350 Jan 19 10:51
> > > mvn-20110119-0956-s3s8h9h2.log
> > > 
> > > (copied to ~wilde)
> > > 
> > > script was:
> > > 
> > > login1$ cat mvn.swift
> > > type file;
> > > 
> > > app (file o) mv (file i)
> > > {
> > > mv @i @o;
> > > }
> > > 
> > > file out[]<simple_mapper; location="outdir",
> > > prefix="f.",suffix=".out">;
> > > foreach j in [1:@toint(@arg("n","1"))] {
> > > file data<"data.txt">;
> > > out[j] = mv(data);
> > > }
> > > 
> > > 
> > > data.txt was 3MB
> > > 
> > > A look at the outdir gives a clue to where things hung: The files of
> > > <= ~3MB from time 10:48 are from this job. Files from 10:39 and
> > > earlier are from other manual runs executed on login1, Note that 3 of
> > > the 3MB output files have length 0 or <3MB, and were likely in transit
> > > back from the worker:
> > > 
> > > -rw-r--r-- 1 wilde ci-users 2686976 Jan 19 10:48 f.0125.out
> > > -rw-r--r-- 1 wilde ci-users 2621440 Jan 19 10:48 f.0167.out
> > > -rw-r--r-- 1 wilde ci-users 0 Jan 19 10:48 f.0259.out
> > > 
> > > 
> > > login1$ pwd
> > > /scratch/local/wilde/lab
> > > login1$ cd outdir
> > > login1$ ls -lt | head -40
> > > total 2772188
> > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0023.out
> > > -rw-r--r-- 1 wilde ci-users 2686976 Jan 19 10:48 f.0125.out
> > > -rw-r--r-- 1 wilde ci-users 2621440 Jan 19 10:48 f.0167.out
> > > -rw-r--r-- 1 wilde ci-users 0 Jan 19 10:48 f.0259.out
> > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0336.out
> > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0380.out
> > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0015.out
> > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0204.out
> > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0379.out
> > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0066.out
> > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0221.out
> > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0281.out
> > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0403.out
> > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0142.out
> > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0187.out
> > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0067.out
> > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0081.out
> > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0134.out
> > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0136.out
> > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0146.out
> > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0254.out
> > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0362.out
> > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0312.out
> > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0370.out
> > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0389.out
> > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0027.out
> > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0094.out
> > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0183.out
> > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0363.out
> > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0016.out
> > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0025.out
> > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0429.out
> > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0239.out
> > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0024.out
> > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0037.out
> > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0001.out
> > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0042.out
> > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0033.out
> > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0051.out
> > > l
> > > 
> > > - Mike
> > > 
> > > 
> > > --
> > > Michael Wilde
> > > Computation Institute, University of Chicago
> > > Mathematics and Computer Science Division
> > > Argonne National Laboratory
> > > 
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel





More information about the Swift-devel mailing list