[Swift-user] Question about packing jobs in Cray XE6 nodes

Michael Wilde wilde at mcs.anl.gov
Mon Mar 26 14:37:07 CDT 2012


First, correction to my prior message: I meant to say "If a node is truly running out of memory..."

Second, I spoke to Justin about existing mechanisms to help you set memory requirements. Nothing implemented at the moment offers a better alternative to the "multiple sites/multiple app versions" approach I described below. 

You can hide the multiple app() function names behind a higher level compound function that selects the right app() variant based eg on calculated memory needs.

I'd start with a simple example that perhaps does, for example, one app for 24 workers per node and one app for one worker per node. Something like this (omotting many details):

myapp()
{
  if (needsMuchMem) {
    myApp01()
  }
  else {
    myApp24()
}

- Mike

----- Original Message -----
> From: "Michael Wilde" <wilde at mcs.anl.gov>
> To: "Lorenzo Pesce" <lpesce at uchicago.edu>
> Cc: swift-user at ci.uchicago.edu
> Sent: Monday, March 26, 2012 2:08:26 PM
> Subject: Re: [Swift-user] Question about packing jobs in Cray XE6 nodes
> Hi Lorenzo,
> 
> ----- Original Message -----
> > From: "Lorenzo Pesce" <lpesce at uchicago.edu>
> > To: swift-user at ci.uchicago.edu
> > Sent: Monday, March 26, 2012 1:38:56 PM
> > Subject: [Swift-user] Question about packing jobs in Cray XE6 nodes
> > Hi all --
> > Thanks a lot for the help so far.
> >
> > Most jobs work fine, but some of them crash. Crashing appears to be
> > caused by either:
> > a) Node runs out of memory (but it seems that it affects only one
> > job,
> > not the whole node -- however, when I send out the job alone it
> > works
> > fine)
> 
> If a none is truly running out of memory, to the point where the Linux
> kernel "out of memory" action is triggered, the entire PBS job will be
> killed. I think that would be more visible to you (likely from PBS
> errors received by Swift).
> 
> > b) Lack of convergence (algorithm needs to be changed)
> >
> >
> > I am testing my hypothesis right now.
> >
> > Is it possible to split the pool of nodes into two groups, one where
> > I
> > run them more packed and one where the more demanding ones are sent?
> 
> Yes; you can create multiple "pool" entries in your sites file with
> different JobsPerNode values. Then you can create multiple versions of
> your app entry or entries in your tc file, with a different app name
> (2nd field) for each site.
> 
> Then in your Swift script you need to create can call multiple app()
> function names, using the app() name to determine what site it runs
> on.
> 
> Thats a bit crude, but it works. A future Swift enhancement might let
> you force an app call to run on a specific site via a settable
> parameter.
> 
> Depending on what you are trying to vary between sites, you might be
> able to do something clever by varying an environment variable within
> a single app function definition. I'll look for that info and post a
> pointer.
> 
> - Mike
> 
> 
> 
> > Thanks a lot,
> >
> > Lorenzo
> > _______________________________________________
> > Swift-user mailing list
> > Swift-user at ci.uchicago.edu
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
> 
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-user mailing list