[Swift-user] Looking for the cause of failure

Andriy Fedorov fedorov at bwh.harvard.edu
Sun Jan 31 09:49:54 CST 2010


On Sat, Jan 30, 2010 at 23:45, Mihael Hategan <hategan at mcs.anl.gov> wrote:
>> With the previous setup, it made more sense, because the number of
>> active jobs was <number of PBS nodes>*<number of workers per node>.
>
> Define "previous setup".

"previous setup" is the site configuration I included in the email
that started this thread.

I just tried this "previous setup", increasing number of workers per
node to 8, and everything worked very well (job status plot attached).

> If it's about one coaster job per node, yes.
> Unfortunately that's also something that prevents scalability with gram2
> or clusters that have limits on the number of jobs in the queue (like
> the BG/P).
>
> You can force that behavior though with maxnodes=1.
>
>>
>> Am I missing something simple? Maybe I should just try the stable
>> branch. I will do this next.
>>
>
> I would advise everybody besides about 2 people doing research on I/O
> scalability with Swift to use the stable branch. Not only does it get
> fixes before trunk, but it doesn't get weird changes that may cause
> random breakage.
>

With the stable branch, and "updated setup" (execution provider
"local:pbs") I have this error message:

/var/spool/torque/mom_priv/jobs/2489852.abem5.ncsa.uiuc.edu.SC: line
10: pdsh: command not found

Should I install pdsh first? I didn't see it right away in the TG
software list. I also don't see instructions in the Swift user guide,
unless I missed it.

>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: karatasks.JOB_SUBMISSION-trails.png
Type: image/png
Size: 5011 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20100131/e4fb53d8/attachment.png>


More information about the Swift-user mailing list