[Swift-devel] Error 521 provider-staging files to PADS nodes

Mihael Hategan hategan at mcs.anl.gov
Mon Jan 24 18:54:11 CST 2011


And then here's the funny thing:
2 workers, 4 wpn.
When running with ppn=2 (so both on the same node):
[IN]: Total transferred: 7.99 GB, current rate: 13.07 MB/s, average
rate: 85.23 MB/s
[OUT] Total transferred: 8 GB, current rate: 42 B/s, average rate: 85.38
MB/s

Same situation, but with ppn=1 (so the two are on different nodes):
[IN]: Total transferred: 5.83 GB, current rate: 20.79 MB/s, average
rate: 20.31 MB/s
[OUT] Total transferred: 5.97 GB, current rate: 32.01 MB/s, average
rate: 20.8 MB/s

This, to me, looks fine because it's the opposite of what I'm expecting.
The service itself should see no difference between the two, and I
suspect it doesn't. But something else is going on. Any ideas?

Mihael


On Mon, 2011-01-24 at 01:41 -0800, Mihael Hategan wrote:
> Play with buffer sizes and ye shall be rewarded.
> 
> Turns out that setting TCP buffer sizes to obscene numbers, like 2M,
> gives you quite a bit: 70MB/s in + 70MB/s out on average. Those pads
> nodes must have some fast disks (though maybe it's just the cache).
> 
> This is with 1 worker and 4wpn. I'm assuming that with many workers, the
> fact that each worker connection has its separate buffer will
> essentially achieve a similar effect. But then there should be an option
> for setting the buffer size.
> 
> The numbers are attached. This all goes from head node local disk to
> worker node local disk directly, so there is no nfs. I'd be curious to
> know how that compares, but I am done for the day.
> 
> Mihael
> 
> On Sun, 2011-01-23 at 13:32 -0800, Mihael Hategan wrote:
> > I'm trying to run tests on pads. The queues aren't quite empty. In the
> > mean time, I committed a bit of a patch to trunk to measure aggregate
> > traffic on TCP channels (those are only used by the workers). You can
> > enable it by setting the "tcp.channel.log.io.performance" system
> > property to "true".
> > 
> > Mihael
> > 
> > On Wed, 2011-01-19 at 13:12 -0800, Mihael Hategan wrote:
> > > might be due to one of the recent patches.
> > > 
> > > you could try to set IOBLOCKSZ to 1 in worker.pl and rerun.
> > > 
> > > On Wed, 2011-01-19 at 13:37 -0600, Michael Wilde wrote:
> > > > An interesting observation on the returned output files:  there are exactly 33 files in the output dir from this run: the same as the number of jobs Swift reports as Finished successfully.   But of those 33, the last 4 are only of partial length, and one of the 4 is length zero (see below).
> > > > 
> > > > Its surprising and perhaps a bug that the jobs are reported finished before the output file is fully written???
> > > > 
> > > > Also this 3-partial plus 1-zero file looks to me like one worker staging op hung (the oldest of the 4 incomplete output files) and then perhaps 3 were cut short when the coaster service data protocol froze?
> > > > 
> > > > - Mike
> > > > 
> > > > login1$ pwd
> > > > /scratch/local/wilde/lab
> > > > login1$ cd outdir
> > > > login1$ ls -lt | grep 10:48      
> > > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0023.out
> > > > -rw-r--r-- 1 wilde ci-users  2686976 Jan 19 10:48 f.0125.out
> > > > -rw-r--r-- 1 wilde ci-users  2621440 Jan 19 10:48 f.0167.out
> > > > -rw-r--r-- 1 wilde ci-users        0 Jan 19 10:48 f.0259.out
> > > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0336.out
> > > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0380.out
> > > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0015.out
> > > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0204.out
> > > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0379.out
> > > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0066.out
> > > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0221.out
> > > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0281.out
> > > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0403.out
> > > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0142.out
> > > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0187.out
> > > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0067.out
> > > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0081.out
> > > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0134.out
> > > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0136.out
> > > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0146.out
> > > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0254.out
> > > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0362.out
> > > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0312.out
> > > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0370.out
> > > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0389.out
> > > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0027.out
> > > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0094.out
> > > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0183.out
> > > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0363.out
> > > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0016.out
> > > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0025.out
> > > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0429.out
> > > > -rw-r--r-- 1 wilde ci-users  3010301 Jan 19 10:48 f.0239.out
> > > > login1$ ls -lt | grep 10:48 | wc -l
> > > > 33
> > > > login1$ 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > ----- Original Message -----
> > > > > Mihael,
> > > > > 
> > > > > The following test on pads failed/hung with an error 521 from
> > > > > worker.pl:
> > > > > 
> > > > > ---
> > > > > sub getFileCBDataInIndirect {
> > > > > ...
> > > > > elsif ($timeout) {
> > > > > queueCmd((nullCB(), "JOBSTATUS", $jobid, FAILED, "521", "Timeout
> > > > > staging in file"));
> > > > > delete($JOBDATA{$jobid});
> > > > > ---
> > > > > 
> > > > > single foreach loop, doing 1,000 "mv" commands
> > > > > 
> > > > > throttle was 200 jobs to this coaster pool (1 4-node 32-core PBS job):
> > > > > 
> > > > > <pool handle="localhost" sysinfo="INTEL32::LINUX">
> > > > > <execution provider="coaster" url="" jobmanager="local:pbs"/>
> > > > > <profile namespace="globus" key="workersPerNode">8</profile>
> > > > > <profile namespace="globus" key="maxTime">3500</profile>
> > > > > <profile namespace="globus" key="slots">1</profile>
> > > > > <profile namespace="globus" key="nodeGranularity">4</profile>
> > > > > <profile namespace="globus" key="maxNodes">4</profile>
> > > > > <profile namespace="globus" key="queue">short</profile>
> > > > > <profile namespace="karajan" key="jobThrottle">2.0</profile>
> > > > > <profile namespace="karajan" key="initialScore">10000</profile>
> > > > > <filesystem provider="local"/>
> > > > > <workdirectory>/scratch/local/wilde/test/swiftwork</workdirectory>
> > > > > <profile namespace="swift" key="stagingMethod">file</profile>
> > > > > <scratch>/scratch/local/wilde/swiftscratch</scratch>
> > > > > </pool>
> > > > > 
> > > > > Ran 33 jobs - 1 job over 1 "wave" of 32 and then one or more workers
> > > > > timed out. Note that the hang may have happened earlier, as no new
> > > > > jobs were starting as the jobs in the first wave were finishing.
> > > > > 
> > > > > time swift -tc.file tc -sites.file pbscoasters.xml -config cf.ps
> > > > > mvn.swift -n=1000 >& out &
> > > > > 
> > > > > 
> > > > > The log is in ~wilde/mvn-20110119-0956-s3s8h9h2.log on CI net.
> > > > > 
> > > > > Swift stdout showed the following after waiting a while for a 4-node
> > > > > PADS coaster allocation to start:
> > > > > 
> > > > > Progress: Selecting site:799 Submitted:201
> > > > > Progress: Selecting site:799 Submitted:201
> > > > > Progress: Selecting site:799 Submitted:200 Active:1
> > > > > Progress: Selecting site:798 Submitted:177 Active:24 Finished
> > > > > successfully:1
> > > > > Progress: Selecting site:796 Submitted:172 Active:28 Finished
> > > > > successfully:4
> > > > > Progress: Selecting site:792 Submitted:176 Active:24 Finished
> > > > > successfully:8
> > > > > Progress: Selecting site:788 Submitted:180 Active:20 Finished
> > > > > successfully:12
> > > > > Progress: Selecting site:784 Submitted:184 Active:16 Finished
> > > > > successfully:16
> > > > > Progress: Selecting site:780 Submitted:188 Active:12 Finished
> > > > > successfully:20
> > > > > Progress: Selecting site:777 Submitted:191 Active:9 Finished
> > > > > successfully:23
> > > > > Progress: Selecting site:773 Submitted:195 Active:5 Finished
> > > > > successfully:27
> > > > > Progress: Selecting site:770 Submitted:197 Active:3 Finished
> > > > > successfully:30
> > > > > Progress: Selecting site:767 Submitted:200 Finished successfully:33
> > > > > Progress: Selecting site:766 Submitted:201 Finished successfully:33
> > > > > Progress: Selecting site:766 Submitted:201 Finished successfully:33
> > > > > Progress: Selecting site:766 Submitted:201 Finished successfully:33
> > > > > Progress: Selecting site:766 Submitted:201 Finished successfully:33
> > > > > Progress: Selecting site:766 Submitted:201 Finished successfully:33
> > > > > Progress: Selecting site:766 Submitted:201 Finished successfully:33
> > > > > Progress: Selecting site:766 Submitted:200 Active:1 Finished
> > > > > successfully:33
> > > > > Execution failed:
> > > > > Job failed with an exit code of 521
> > > > > login1$
> > > > > login1$
> > > > > login1$ pwd
> > > > > /scratch/local/wilde/lab
> > > > > login1$ ls -lt | head
> > > > > total 51408
> > > > > -rw-r--r-- 1 wilde ci-users 5043350 Jan 19 10:51
> > > > > mvn-20110119-0956-s3s8h9h2.log
> > > > > 
> > > > > (copied to ~wilde)
> > > > > 
> > > > > script was:
> > > > > 
> > > > > login1$ cat mvn.swift
> > > > > type file;
> > > > > 
> > > > > app (file o) mv (file i)
> > > > > {
> > > > > mv @i @o;
> > > > > }
> > > > > 
> > > > > file out[]<simple_mapper; location="outdir",
> > > > > prefix="f.",suffix=".out">;
> > > > > foreach j in [1:@toint(@arg("n","1"))] {
> > > > > file data<"data.txt">;
> > > > > out[j] = mv(data);
> > > > > }
> > > > > 
> > > > > 
> > > > > data.txt was 3MB
> > > > > 
> > > > > A look at the outdir gives a clue to where things hung: The files of
> > > > > <= ~3MB from time 10:48 are from this job. Files from 10:39 and
> > > > > earlier are from other manual runs executed on login1, Note that 3 of
> > > > > the 3MB output files have length 0 or <3MB, and were likely in transit
> > > > > back from the worker:
> > > > > 
> > > > > -rw-r--r-- 1 wilde ci-users 2686976 Jan 19 10:48 f.0125.out
> > > > > -rw-r--r-- 1 wilde ci-users 2621440 Jan 19 10:48 f.0167.out
> > > > > -rw-r--r-- 1 wilde ci-users 0 Jan 19 10:48 f.0259.out
> > > > > 
> > > > > 
> > > > > login1$ pwd
> > > > > /scratch/local/wilde/lab
> > > > > login1$ cd outdir
> > > > > login1$ ls -lt | head -40
> > > > > total 2772188
> > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0023.out
> > > > > -rw-r--r-- 1 wilde ci-users 2686976 Jan 19 10:48 f.0125.out
> > > > > -rw-r--r-- 1 wilde ci-users 2621440 Jan 19 10:48 f.0167.out
> > > > > -rw-r--r-- 1 wilde ci-users 0 Jan 19 10:48 f.0259.out
> > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0336.out
> > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0380.out
> > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0015.out
> > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0204.out
> > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0379.out
> > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0066.out
> > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0221.out
> > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0281.out
> > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0403.out
> > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0142.out
> > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0187.out
> > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0067.out
> > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0081.out
> > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0134.out
> > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0136.out
> > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0146.out
> > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0254.out
> > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0362.out
> > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0312.out
> > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0370.out
> > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0389.out
> > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0027.out
> > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0094.out
> > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0183.out
> > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0363.out
> > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0016.out
> > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0025.out
> > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0429.out
> > > > > -rw-r--r-- 1 wilde ci-users 3010301 Jan 19 10:48 f.0239.out
> > > > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0024.out
> > > > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0037.out
> > > > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0001.out
> > > > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0042.out
> > > > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0033.out
> > > > > -rw-r--r-- 1 wilde ci-users 30103010 Jan 19 10:39 f.0051.out
> > > > > l
> > > > > 
> > > > > - Mike
> > > > > 
> > > > > 
> > > > > --
> > > > > Michael Wilde
> > > > > Computation Institute, University of Chicago
> > > > > Mathematics and Computer Science Division
> > > > > Argonne National Laboratory
> > > > > 
> > > > > _______________________________________________
> > > > > Swift-devel mailing list
> > > > > Swift-devel at ci.uchicago.edu
> > > > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > > > 
> > > 
> > > 
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> > 
> > 
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel





More information about the Swift-devel mailing list