[Swift-devel] Error 521 provider-staging files to PADS nodes

Mihael Hategan hategan at mcs.anl.gov
Mon Jan 24 03:41:28 CST 2011


Play with buffer sizes and ye shall be rewarded.

Turns out that setting TCP buffer sizes to obscene numbers, like 2M,
gives you quite a bit: 70MB/s in + 70MB/s out on average. Those pads
nodes must have some fast disks (though maybe it's just the cache).

This is with 1 worker and 4wpn. I'm assuming that with many workers, the
fact that each worker connection has its separate buffer will
essentially achieve a similar effect. But then there should be an option
for setting the buffer size.

The numbers are attached. This all goes from head node local disk to
worker node local disk directly, so there is no nfs. I'd be curious to
know how that compares, but I am done for the day.

Mihael

On Sun, 2011-01-23 at 13:32 -0800, Mihael Hategan wrote:
> I'm trying to run tests on pads. The queues aren't quite empty. In the
> mean time, I committed a bit of a patch to trunk to measure aggregate
> traffic on TCP channels (those are only used by the workers). You can
> enable it by setting the "tcp.channel.log.io.performance" system
> property to "true".
> 
> Mihael
> 
> On Wed, 2011-01-19 at 13:12 -0800, Mihael Hategan wrote:
> > might be due to one of the recent patches.
> > 
> > you could try to set IOBLOCKSZ to 1 in worker.pl and rerun.
> > 
> > On Wed, 2011-01-19 at 13:37 -0600, Michael Wilde wrote:
> > > An interesting observation on the returned output files:  there are exactly 33 files in the output dir from this run: the same as the number of jobs Swift reports as Finished successfully.   But of those 33, the last 4 are only of partial length, and one of the 4 is length zero (see below).
> > > 
> > > Its surprising and perhaps a bug that the jobs are reported finished before the output file is fully written???
> > > 
> > > Also this 3-partial plus 1-zero file looks to me like one worker staging op hung (the oldest of the 4 incomplete output files) and then perhaps 3 were cut short when the coaster service data protocol froze?
> > > 
> > > - Mike
> > > 
> > > login1$ pwd
> > > /scratch/local/wilde/lab
> > > login1$ cd outdir
> > > login1$ ls -lt | grep 10:48      
> > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0023.out
> > > -rw-r--r-- 1 wilde ci-users  2686976 Jan 19 10:48 f.0125.out
> > > -rw-r--r-- 1 wilde ci-users  2621440 Jan 19 10:48 f.0167.out
> > > -rw-r--r-- 1 wilde ci-users        0 Jan 19 10:48 f.0259.out
> > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0336.out
> > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0380.out
> > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0015.out
> > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0204.out
> > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0379.out
> > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0066.out
> > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0221.out
> > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0281.out
> > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0403.out
> > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0142.out
> > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0187.out
> > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0067.out
> > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0081.out
> > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0134.out
> > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0136.out
> > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0146.out
> > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0254.out
> > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0362.out
> > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0312.out
> > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0370.out
> > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0389.out
> > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0027.out
> > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0094.out
> > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0183.out
> > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0363.out
> > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0016.out
> > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0025.out
> > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0429.out
> > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0239.out
> > > login1$ ls -lt | grep 10:48 | wc -l
> > > 33
> > > login1$ 
> > > 
> > > 
> > > 
> > > 
> > > ----- Original Message -----
> > > > Mihael,
> > > > 
> > > > The following test on pads failed/hung with an error 521 from
> > > > worker.pl:
> > > > 
> > > > ---
> > > > sub getFileCBDataInIndirect {
> > > > ...
> > > > elsif ($timeout) {
> > > > queueCmd((nullCB(), "JOBSTATUS", $jobid, FAILED, "521", "Timeout
> > > > staging in file"));
> > > > delete($JOBDATA{$jobid});
> > > > ---
> > > > 
> > > > single foreach loop, doing 1,000 "mv" commands
> > > > 
> > > > throttle was 200 jobs to this coaster pool (1 4-node 32-core PBS job):
> > > > 
> > > > <pool handle="localhost" sysinfo="INTEL32::LINUX">
> > > > <execution provider="coaster" url="" jobmanager="local:pbs"/>
> > > > <profile namespace="globus" key="workersPerNode">8</profile>
> > > > <profile namespace="globus" key="maxTime">3500</profile>
> > > > <profile namespace="globus" key="slots">1</profile>
> > > > <profile namespace="globus" key="nodeGranularity">4</profile>
> > > > <profile namespace="globus" key="maxNodes">4</profile>
> > > > <profile namespace="globus" key="queue">short</profile>
> > > > <profile namespace="karajan" key="jobThrottle">2.0</profile>
> > > > <profile namespace="karajan" key="initialScore">10000</profile>
> > > > <filesystem provider="local"/>
> > > > <workdirectory>/scratch/local/wilde/test/swiftwork</workdirectory>
> > > > <profile namespace="swift" key="stagingMethod">file</profile>
> > > > <scratch>/scratch/local/wilde/swiftscratch</scratch>
> > > > </pool>
> > > > 
> > > > Ran 33 jobs - 1 job over 1 "wave" of 32 and then one or more workers
> > > > timed out. Note that the hang may have happened earlier, as no new
> > > > jobs were starting as the jobs in the first wave were finishing.
> > > > 
> > > > time swift -tc.file tc -sites.file pbscoasters.xml -config cf.ps
> > > > mvn.swift -n=1000 >& out &
> > > > 
> > > > 
> > > > The log is in ~wilde/mvn-20110119-0956-s3s8h9h2.log on CI net.
> > > > 
> > > > Swift stdout showed the following after waiting a while for a 4-node
> > > > PADS coaster allocation to start:
> > > > 
> > > > Progress: Selecting site:799 Submitted:201
> > > > Progress: Selecting site:799 Submitted:201
> > > > Progress: Selecting site:799 Submitted:200 Active:1
> > > > Progress: Selecting site:798 Submitted:177 Active:24 Finished
> > > > successfully:1
> > > > Progress: Selecting site:796 Submitted:172 Active:28 Finished
> > > > successfully:4
> > > > Progress: Selecting site:792 Submitted:176 Active:24 Finished
> > > > successfully:8
> > > > Progress: Selecting site:788 Submitted:180 Active:20 Finished
> > > > successfully:12
> > > > Progress: Selecting site:784 Submitted:184 Active:16 Finished
> > > > successfully:16
> > > > Progress: Selecting site:780 Submitted:188 Active:12 Finished
> > > > successfully:20
> > > > Progress: Selecting site:777 Submitted:191 Active:9 Finished
> > > > successfully:23
> > > > Progress: Selecting site:773 Submitted:195 Active:5 Finished
> > > > successfully:27
> > > > Progress: Selecting site:770 Submitted:197 Active:3 Finished
> > > > successfully:30
> > > > Progress: Selecting site:767 Submitted:200 Finished successfully:33
> > > > Progress: Selecting site:766 Submitted:201 Finished successfully:33
> > > > Progress: Selecting site:766 Submitted:201 Finished successfully:33
> > > > Progress: Selecting site:766 Submitted:201 Finished successfully:33
> > > > Progress: Selecting site:766 Submitted:201 Finished successfully:33
> > > > Progress: Selecting site:766 Submitted:201 Finished successfully:33
> > > > Progress: Selecting site:766 Submitted:201 Finished successfully:33
> > > > Progress: Selecting site:766 Submitted:200 Active:1 Finished
> > > > successfully:33
> > > > Execution failed:
> > > > Job failed with an exit code of 521
> > > > login1$
> > > > login1$
> > > > login1$ pwd
> > > > /scratch/local/wilde/lab
> > > > login1$ ls -lt | head
> > > > total 51408
> > > > -rw-r--r-- 1 wilde ci-users 5043350 Jan 19 10:51
> > > > mvn-20110119-0956-s3s8h9h2.log
> > > > 
> > > > (copied to ~wilde)
> > > > 
> > > > script was:
> > > > 
> > > > login1$ cat mvn.swift
> > > > type file;
> > > > 
> > > > app (file o) mv (file i)
> > > > {
> > > > mv @i @o;
> > > > }
> > > > 
> > > > file out[]<simple_mapper; location="outdir",
> > > > prefix="f.",suffix=".out">;
> > > > foreach j in [1:@toint(@arg("n","1"))] {
> > > > file data<"data.txt">;
> > > > out[j] = mv(data);
> > > > }
> > > > 
> > > > 
> > > > data.txt was 3MB
> > > > 
> > > > A look at the outdir gives a clue to where things hung: The files of
> > > > <= ~3MB from time 10:48 are from this job. Files from 10:39 and
> > > > earlier are from other manual runs executed on login1, Note that 3 of
> > > > the 3MB output files have length 0 or <3MB, and were likely in transit
> > > > back from the worker:
> > > > 
> > > > -rw-r--r-- 1 wilde ci-users 2686976 Jan 19 10:48 f.0125.out
> > > > -rw-r--r-- 1 wilde ci-users 2621440 Jan 19 10:48 f.0167.out
> > > > -rw-r--r-- 1 wilde ci-users 0 Jan 19 10:48 f.0259.out
> > > > 
> > > > 
> > > > login1$ pwd
> > > > /scratch/local/wilde/lab
> > > > login1$ cd outdir
> > > > login1$ ls -lt | head -40
> > > > total 2772188
> > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0023.out
> > > > -rw-r--r-- 1 wilde ci-users 2686976 Jan 19 10:48 f.0125.out
> > > > -rw-r--r-- 1 wilde ci-users 2621440 Jan 19 10:48 f.0167.out
> > > > -rw-r--r-- 1 wilde ci-users 0 Jan 19 10:48 f.0259.out
> > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0336.out
> > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0380.out
> > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0015.out
> > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0204.out
> > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0379.out
> > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0066.out
> > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0221.out
> > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0281.out
> > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0403.out
> > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0142.out
> > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0187.out
> > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0067.out
> > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0081.out
> > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0134.out
> > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0136.out
> > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0146.out
> > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0254.out
> > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0362.out
> > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0312.out
> > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0370.out
> > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0389.out
> > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0027.out
> > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0094.out
> > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0183.out
> > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0363.out
> > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0016.out
> > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0025.out
> > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0429.out
> > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0239.out
> > > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0024.out
> > > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0037.out
> > > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0001.out
> > > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0042.out
> > > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0033.out
> > > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0051.out
> > > > l
> > > > 
> > > > - Mike
> > > > 
> > > > 
> > > > --
> > > > Michael Wilde
> > > > Computation Institute, University of Chicago
> > > > Mathematics and Computer Science Division
> > > > Argonne National Laboratory
> > > > 
> > > > _______________________________________________
> > > > Swift-devel mailing list
> > > > Swift-devel at ci.uchicago.edu
> > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > 
> > 
> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel

-------------- next part --------------
PADS:
1m, cat, file, 3-4:
Progress:  Submitted:2  Active:2  Finished successfully:252
[IN]: Total transferred: 255.43 MB, current rate: 5.07 MB/s, average rate: 6.39 MB/s
[MEM] Heap total: 39.62 MMB, Heap used: 33.19 MMB
[OUT] Total transferred: 257.87 MB, current rate: 1.42 MB/s, average rate: 6.45 MB/s


32m, cat, file, 1-4:
Progress:  Selecting site:209  Stage in:1  Submitted:6  Active:2  Finished successfully:38
[IN]: Total transferred: 1.19 GB, current rate: 2.01 MB/s, average rate: 2.07 MB/s
[MEM] Heap total: 27.81 MMB, Heap used: 23.84 MMB
[OUT] Total transferred: 1.26 GB, current rate: 1 MB/s, average rate: 2.19 MB/s

32m, cat, file, 1-12:
Progress:  Selecting site:205  Stage in:1  Submitted:5  Active:3  Finished successfully:42
[IN]: Total transferred: 1.32 GB, current rate: 2.02 MB/s, average rate: 2.01 MB/s
[MEM] Heap total: 45 MMB, Heap used: 28.49 MMB
[OUT] Total transferred: 1.43 GB, current rate: 672.25 KB/s, average rate: 2.18 MB/s

32m, cat, file, 1-4, ppn=2:
Progress:  Selecting site:222  Submitted:3  Active:6  Finished successfully:25
[IN]: Total transferred: 860.34 MB, current rate: 4.3 MB/s, average rate: 4.14 MB/s
[MEM] Heap total: 87.25 MMB, Heap used: 66.65 MMB
[OUT] Total transferred: 1018.62 MB, current rate: 1.07 MB/s, average rate: 4.9 MB/s

32m, cat, file, 1-4, ppn=1, bufsz=2*32768:
Progress:  Selecting site:218  Stage in:1  Submitted:8  Active:1  Finished successfully:28
[IN]: Total transferred: 928.4 MB, current rate: 3.63 MB/s, average rate: 4.3 MB/s
[MEM] Heap total: 49.38 MMB, Heap used: 21.04 MMB
[OUT] Total transferred: 1006.27 MB, current rate: 29.05 MB/s, average rate: 4.66 MB/s

32m, cat, file, 1-4, ppn=1, bufsz=4*32768:
Progress:  Selecting site:212  Submitted:8  Finished successfully:36
[IN]: Total transferred: 1.13 GB, current rate: 6.88 MB/s, average rate: 7.53 MB/s
[MEM] Heap total: 33.75 MMB, Heap used: 27.72 MMB
[OUT] Total transferred: 1.2 GB, current rate: 32.51 MB/s, average rate: 8.01 MB/s

32m, cat, file, 1-4, ppn=1, bufsz=8*32768:
Progress:  Selecting site:131  Submitted:5  Active:2  Finished successfully:118
[IN]: Total transferred: 3.75 GB, current rate: 17.63 MB/s, average rate: 17.62 MB/s
[MEM] Heap total: 49.06 MMB, Heap used: 31.45 MMB
[OUT] Total transferred: 3.78 GB, current rate: 25.04 MB/s, average rate: 17.77 MB/s

32m, cat, file, 1-4, ppn=1, bufsz=16*32768:
Progress:  Selecting site:125  Submitted:7  Active:1  Finished successfully:123
[IN]: Total transferred: 3.88 GB, current rate: 31.54 MB/s, average rate: 32.81 MB/s
[MEM] Heap total: 66.06 MMB, Heap used: 52.44 MMB
[OUT] Total transferred: 3.94 GB, current rate: 41.39 MB/s, average rate: 33.34 MB/s

32m, cat, file, 1-4, ppn=1, bufsz=32*32768:
Progress:  Selecting site:55  Submitted:7  Active:1  Finished successfully:193
[IN]: Total transferred: 6.05 GB, current rate: 61.8 MB/s, average rate: 58.49 MB/s
[MEM] Heap total: 88.5 MMB, Heap used: 31.79 MMB
[OUT] Total transferred: 6.12 GB, current rate: 64.16 MB/s, average rate: 59.08 MB/s

32m, cat, file, 1-4, ppn=1, bufsz=64*32768:
Progress:  Submitted:3  Active:2  Finished successfully:251
[IN]: Total transferred: 7.93 GB, current rate: 86.94 MB/s, average rate: 70.58 MB/s
[MEM] Heap total: 122.75 MMB, Heap used: 66.48 MMB
[OUT] Total transferred: 7.99 GB, current rate: 66.47 MB/s, average rate: 71.11 MB/s



More information about the Swift-devel mailing list