[Swift-devel] Running on multicore hosts
Michael Wilde
wilde at mcs.anl.gov
Tue Jul 28 19:47:50 CDT 2009
Tibi,
You should be able to do some preliminary tests of your econ app on
QueenBee using GRAM5.
The GRAM contact URIs Stu posted were:
queenbee.loni-lsu.teragrid.org:2120/jobmanager-fork
queenbee.loni-lsu.teragrid.org:2120/jobmanager-pbs
To use all 8 cores of the hosts, turn on Swift clustering.
Then edit libexec/_swiftseq to run all the jobs in a cluster in parallel
rather than serially.
1) add an & to the line where the jobs are exec'ed:
"$EXEC" "${ARGS[@]}" &
2) add a wait at the end of the script:
done
wait
echo `date +%s` DONE >> $WRAPPERLOG
Then turn on clustering. You need to do the math to get a fixed cluster
size of NCPUs, 8 for QueenBee and Abe. 16 for Ranger.
For oops we used:
clustering.enabled=true
clustering.min.time=480
clustering.queue.delay=15
with a GLOBUS::maxwalltime="00:01:00"
This gave clusters of 480/60 = 8, and PBS walltimes of 8 minutes.
To note:
- the site maxwalltime was ignored; Swift calculated the PBS maxwalltime
form the cluster size it built.
- contrary to the user guide, Swift seemed to use
clustering.min.time/(tc.data time)
rather than
(2*clustering.min.time)/(tc.data time)
That needs investigation; it may be a matter of interpretation or may be
describing a case where more jobs could enter the cluster queue before
Swift has a chance to close the cluster.
- When we are more sure this works, we can commit a reference file
_swiftpar to the libexec directory.
- at the moment the simple hack punts on per-job error code return with
the cluster. The sequential cluster script passes on the error code of
the first job in the cluster to fail, and aborts the rest of the
cluster. The heck above treats the cluster as if all jobs succeeded. Im
not sure if the per-job error codes make it back via _swiftwrap. if not,
they could be made to.
In any case, this is at the moment a temporary but simple hack to use
sites with multicore nodes, while coasters is being debugged.
It could readily be generalized though into straightforward direct
support for multicore hosts over GRAM5, PBS, or Condor-G.
- Mike
More information about the Swift-devel
mailing list