[Swift-user] Looking for the cause of failure
Mihael Hategan
hategan at mcs.anl.gov
Sun Jan 31 09:56:42 CST 2010
On Sun, 2010-01-31 at 10:49 -0500, Andriy Fedorov wrote:
> On Sat, Jan 30, 2010 at 23:45, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> >> With the previous setup, it made more sense, because the number of
> >> active jobs was <number of PBS nodes>*<number of workers per node>.
> >
> > Define "previous setup".
>
> "previous setup" is the site configuration I included in the email
> that started this thread.
>
> I just tried this "previous setup", increasing number of workers per
> node to 8, and everything worked very well (job status plot attached).
>
> > If it's about one coaster job per node, yes.
> > Unfortunately that's also something that prevents scalability with gram2
> > or clusters that have limits on the number of jobs in the queue (like
> > the BG/P).
> >
> > You can force that behavior though with maxnodes=1.
> >
> >>
> >> Am I missing something simple? Maybe I should just try the stable
> >> branch. I will do this next.
> >>
> >
> > I would advise everybody besides about 2 people doing research on I/O
> > scalability with Swift to use the stable branch. Not only does it get
> > fixes before trunk, but it doesn't get weird changes that may cause
> > random breakage.
> >
>
> With the stable branch, and "updated setup" (execution provider
> "local:pbs") I have this error message:
>
> /var/spool/torque/mom_priv/jobs/2489852.abem5.ncsa.uiuc.edu.SC: line
> 10: pdsh: command not found
>
> Should I install pdsh first?
Yes. Might have a softenv package.
> I didn't see it right away in the TG
> software list. I also don't see instructions in the Swift user guide,
> unless I missed it.
It's relatively new. There was also the assumption that it would be
installed pretty much everywhere, but it doesn't seem to be the case, so
I', thinking a plain ssh solution (which is what gram does) may be
better.
More information about the Swift-user
mailing list