[Swift-devel] Running on multicore hosts

Michael Wilde wilde at mcs.anl.gov
Tue Jul 28 19:47:50 CDT 2009


Tibi,

You should be able to do some preliminary tests of your econ app on 
QueenBee using GRAM5.

The GRAM contact URIs Stu posted were:

queenbee.loni-lsu.teragrid.org:2120/jobmanager-fork
queenbee.loni-lsu.teragrid.org:2120/jobmanager-pbs

To use all 8 cores of the hosts, turn on Swift clustering.

Then edit libexec/_swiftseq to run all the jobs in a cluster in parallel 
rather than serially.

1) add an & to the line where the jobs are exec'ed:

         "$EXEC" "${ARGS[@]}" &

2) add a wait at the end of the script:

done
wait
echo `date +%s` DONE >> $WRAPPERLOG

Then turn on clustering. You need to do the math to get a fixed cluster 
size of NCPUs, 8 for QueenBee and Abe. 16 for Ranger.

For oops we used:

clustering.enabled=true
clustering.min.time=480
clustering.queue.delay=15

with a GLOBUS::maxwalltime="00:01:00"

This gave clusters of 480/60 = 8, and PBS walltimes of 8 minutes.

To note:

- the site maxwalltime was ignored; Swift calculated the PBS maxwalltime 
form the cluster size it built.

- contrary to the user guide, Swift seemed to use 
clustering.min.time/(tc.data time)
rather than
(2*clustering.min.time)/(tc.data time)

That needs investigation; it may be a matter of interpretation or may be 
describing a case where more jobs could enter the cluster queue before 
Swift has a chance to close the cluster.

- When we are more sure this works, we can commit a reference file 
_swiftpar to the libexec directory.

- at the moment the simple hack punts on per-job error code return with 
the cluster. The sequential cluster script passes on the error code of 
the first job in the cluster to fail, and aborts the rest of the 
cluster. The heck above treats the cluster as if all jobs succeeded. Im 
not sure if the per-job error codes make it back via _swiftwrap. if not, 
they could be made to.

In any case, this is at the moment a temporary but simple hack to use 
sites with multicore nodes, while coasters is being debugged.

It could readily be generalized though into straightforward direct 
support for multicore hosts over GRAM5, PBS, or Condor-G.

- Mike








More information about the Swift-devel mailing list