[Swift-user] Looking for the cause of failure

Andriy Fedorov fedorov at bwh.harvard.edu
Mon Feb 1 22:22:05 CST 2010


Hi Mihael,

pdsh does not seem to be available via softenv. I compiled it from
source. My fresh errors are the following:

"Worker task failed: 0201-560951-000000 Block task ended prematurely"

and

"pdsh at abe1177: abe1172: rcmd: socket: Permission denied"

Do I need to perform some special setup of pdsh? Meanwhile, my
exercises with stick lead me to conclusion that I should rather use
"gt2:gt2:pbs" than "local:pbs" jobmanager in coaster configuration....

--
Andriy Fedorov, Ph.D.

Research Fellow
Brigham and Women's Hospital
Harvard Medical School
75 Francis Street
Boston, MA 02115 USA
fedorov at bwh.harvard.edu



On Sun, Jan 31, 2010 at 10:56, Mihael Hategan <hategan at mcs.anl.gov> wrote:
> On Sun, 2010-01-31 at 10:49 -0500, Andriy Fedorov wrote:
>> On Sat, Jan 30, 2010 at 23:45, Mihael Hategan <hategan at mcs.anl.gov> wrote:
>> >> With the previous setup, it made more sense, because the number of
>> >> active jobs was <number of PBS nodes>*<number of workers per node>.
>> >
>> > Define "previous setup".
>>
>> "previous setup" is the site configuration I included in the email
>> that started this thread.
>>
>> I just tried this "previous setup", increasing number of workers per
>> node to 8, and everything worked very well (job status plot attached).
>>
>> > If it's about one coaster job per node, yes.
>> > Unfortunately that's also something that prevents scalability with gram2
>> > or clusters that have limits on the number of jobs in the queue (like
>> > the BG/P).
>> >
>> > You can force that behavior though with maxnodes=1.
>> >
>> >>
>> >> Am I missing something simple? Maybe I should just try the stable
>> >> branch. I will do this next.
>> >>
>> >
>> > I would advise everybody besides about 2 people doing research on I/O
>> > scalability with Swift to use the stable branch. Not only does it get
>> > fixes before trunk, but it doesn't get weird changes that may cause
>> > random breakage.
>> >
>>
>> With the stable branch, and "updated setup" (execution provider
>> "local:pbs") I have this error message:
>>
>> /var/spool/torque/mom_priv/jobs/2489852.abem5.ncsa.uiuc.edu.SC: line
>> 10: pdsh: command not found
>>
>> Should I install pdsh first?
>
> Yes. Might have a softenv package.
>
>>  I didn't see it right away in the TG
>> software list. I also don't see instructions in the Swift user guide,
>> unless I missed it.
>
> It's relatively new. There was also the assumption that it would be
> installed pretty much everywhere, but it doesn't seem to be the case, so
> I', thinking a plain ssh solution (which is what gram does) may be
> better.
>
>



More information about the Swift-user mailing list