[ExM Users] General Swift/T questions

David Kelly davidkelly at uchicago.edu
Tue Sep 23 23:39:03 CDT 2014


Hello,

I've been thinking about the possibility of running the psims application
in Swift/T. I just have a few general questions that I didn't see answered
in the user guide.

Does Swift/T have the ability to retry failed tasks?

Does it have a resume option for failed workflows?

Is there an ability to limit the walltime of a task? We're working with a
bunch of different models, some of which will behave badly from time to
time and hang. When this happens, we'd like to end the task and retry it
(preferably on a different node)

Is there any ability to detect node failures? (If a single node is
repeatedly failing all tasks, to remove it from the pool so no more tasks
get sent there?)

Input files will be available on a shared filesystem, but we'd like to
avoid shared disk I/O scaling problems by using the local disks whenever
possible. Does Swift/T have the concept of a scratch directory where
intermediate files can go? (Maybe this has to be done in the wrapper
script?)

We'd be running this on Midway. Since Swift/T uses MPI, does everything
have to be launched within a single slurm job? The load on Midway varies
greatly. Swift/K allows us submit many small slurm jobs to dynamically grow
our worker pool as nodes become available.

Thanks,
David
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/exm-user/attachments/20140923/77011e49/attachment.html>


More information about the ExM-user mailing list