[Swift-devel] Swift and BGP

Mon Oct 26 16:36:15 CDT 2009

Mihael Hategan wrote:
> I've been playing with Swift on the BGP the past few days.
>
> My observation is that with the current swift and reasonable mapping
> (such as the one done by the concurrent mapper) filesystem slowness does
> not seem to be caused by contentious access to the same file/directory.
> Instead, it's the sheer amount of network filesystem requests which come
> from a few sources:
>
> - bash: every time bash forks a process it closes the current script,
> forks the process; after the process is done, bash re-opens the script
> file, seeks to the position it left it at and reads the next command.
> And typical scripts involve forking a few processes. Our wrapper script
> is invoked while on a networked FS.
> - info: about 30 requests per run 
> - swift-specific fs access (what the wrapper is meant to do)
> - application fs requests
>
> At 100 jobs/s, only the wrapper causes about 10000 requests/s to the fs
> server.
>   
Here were our experiences with running scripts from GPFS. The #s below 
represents the throughput for invoking scripts (a bash script that 
invoked a sleep 0) from GPFS on 4 workers, 256 workers, and 2048 workers.
Number of Processors 	Invoke script throughput (ops/sec)
4 	125.214
256 	109.3272
2048 	823.0374

> I suspect that what Allan observed with the moving of the output files
> being slow is a coincidence. I did a run which showed that for jobs
> towards the start, the operations towards the end of the wrapper
> execution are slow, while jobs towards the end have the first part of
> the wrapper process running slower. This is likely due to ramp-up and
> ramp-down. I wanted to plot that, but BGP is down today, so it will have
> to wait.
>
> The solution is the having things on the node local FS. Ben already
> added some code to do that. I changed that a bit and also moved the info
> file to the scratch fs (unless the user requests that the info be on NFS
> in order to get progressive results for debugging purposes). A scratch
> directory different than the work directory is used whenever the user
> specifies <scratch>dir</scratch> in sites.xml.
>
> Another thing is using provider job status instead of files when using
> coasters or falkon.
>
> With coasters, scratch FS, and provider status, I empirically determined
> that an average throughput of 100jobs/s is something that the system
> (swift + coasters) can sustain well, provided that swift tries to keep
> the number of jobs submitted to the coaster service to about twice the
> number of workers. I tested this with 6000 workers and 60 second jobs. I
> will post the plots shortly.
>
> So here's how one would go with this on intrepid:
> - determine the maximum number of workers (avg-exec-time * 100)
> - set the nodeGranularity to 512 nodes, 4 workers per node. Also set
> maxWorkers to 512 so that only 512 node blocks are requested. For some
> reason 512 node partitions start almost instantly (even if you have 6 of
> them) while 1024 node partitions you have to wait for.
> - set the total number of blocks ("slots" parameter) to
> no-of-workers/2048.
> - set the jobThrottle to 2*no-of-workers/100
> - make sure you also have foreach,max.threads set to 2*no-of-workers
> (though that depends on the structure of the program).
> - run on login6. There is no point in using the normal login machines
> since they have a limit of 1024 file descriptors per process.
>
> I will actually code an xml element for sites.xml to capture this
> without that much pain.
>
> There is eventually a hard limit of (a bit less than) 65536 workers. I
> think. This is because each TCP connection from the workers requires a
> local port on the coaster service side, and there's a limit of 2^16 to
> that. This could eventually be addressed by having proxies on the IO
> nodes or something.
>   
In our experience with Falkon, the limit came much sooner than 64K. In 
Falkon, using the C worker code (which runs on the BG/P), each worker 
consumes 2 TCP/IP connections to the Falkon service. In the centralized 
Falkon service version, this racks up connections pretty quick. I don't 
recall at exactly what point we started having issues, but it was 
somewhere in the range of 10K~20K CPU cores. Essentially, we could 
establish all the connections (20K~40K TCP connections), but when the 
experiment would actually start, and data needed to flow over these 
connections, all sort of weird stuff started happening, TCP connection 
would get reset, workers were failing (e.g. their TCP connection was 
being severed and not being re-established), etc. I want to say that 8K 
(maybe 16K) cores was the largest tests we made on the BG/P with a 
centralized Falkon service, that were stable and successful.

However, even at these scales, the sys admins were complaining that we 
were running the login nodes too hard, and we couldn't scale much more, 
without taking over multiple login nodes ;) So, we were motivated to go 
the distributed service route, where we ran the Falkon service on the 
I/O nodes, and each service only managed 256 cores. This approach adds 
some overhead (especially as the I/O nodes are quite underpowered 
compared to the login nodes), but it gave us a more scalable solution, 
that worked great from 256 cores to 160K cores.

For the BG/P specifically, I think the distribution of the Falkon 
service to the I/O nodes gave us a low maintanance, robust, and scalable 
solution!

Ioan
> On intrepid (as opposed to surveyor) the default queue won't accept
> 1-node jobs. So the cleanup job at the end of a run will fail with a
> nice display of a stack trace and you will have to manually clean up the
> work directory.
>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
>   

-- 
=================================================================
Ioan Raicu, Ph.D.
NSF/CRA Computing Innovation Fellow
=================================================================
Center for Ultra-scale Computing and Information Security (CUCIS)
Department of Electrical Engineering and Computer Science
Northwestern University
2145 Sheridan Rd, Tech M384 
Evanston, IL 60208-3118
=================================================================
Cel:   1-847-722-0876
Tel:   1-847-491-8163
Email: iraicu at eecs.northwestern.edu
Web:   http://www.eecs.northwestern.edu/~iraicu/
       https://wiki.cucis.eecs.northwestern.edu/
=================================================================
=================================================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20091026/479970bb/attachment.html>