[Swift-devel] Swift/K on Titan
Michael Wilde
wilde at mcs.anl.gov
Wed Aug 7 09:40:40 CDT 2013
Scott, David,
The problem here is this: Swift is still generating a submit file for a "vanilla" PBS cluster and trying to ssh to the nodes of the job to start a coaster on each node.
For Cray systems it needs to use aprun instead. So that part of the sites file needs to be re-adjusted.
You can see in the *.submit.stderr file all the errors from ssh, and from the .submit file, the fact that the ssh logic is there at all is incorrect for Cray systems.
You should check the beagle logic and behavior, and then get the same to work on Titan but with the Titan-specific adjustments to the submit-file. Maybe "mpp" should be "cray" with variants "cray-titan" and cray-kraken.
- Mike
----- Original Message -----
> From: "Scott Krieder" <skrieder at iit.edu>
> To: "Michael Wilde" <wilde at mcs.anl.gov>
> Cc: "Swift Language" <davidk at ci.uchicago.edu>
> Sent: Wednesday, August 7, 2013 9:14:03 AM
> Subject: Re: Swift/K on Titan
>
>
> Ok, I made sure there was nothing else in the queue and I ran again.
> I'm still having the same issue where the job was marked as running
> for around 2 minutes through qstat -u csep44 but swift was only
> reporting "submitted."
>
>
> I've attached the latest log, submit and stderr from that run.
>
>
> -Scott
>
>
>
> On Wed, Aug 7, 2013 at 8:56 AM, Michael Wilde < wilde at mcs.anl.gov >
> wrote:
>
>
> Scott, David,
>
> I think this message in the log explains the root cause:
>
> 2013-08-06 21:28:16,528-0400 DEBUG AbstractExecutor Output from qsub
> is: " Job not submitted You currently have a job in the debug queue.
> Each user is allowed to have only one job at a time in the debug
> queue. Please wait until job 1695274 completes before submitting
> another job to the debug queue. "
>
> The "bug" here is simply that Swift doesnt send these messages back
> to the user in a clear manner.
>
> Can you remove the offending job (if its still queued) and try again?
>
> Thanks,
>
> - Mike
>
>
>
> ----- Original Message -----
> > From: "Scott Krieder" < skrieder at iit.edu >
> > To: "Michael Wilde" < wilde at mcs.anl.gov >
> > Cc: "Swift Language" < davidk at ci.uchicago.edu >
>
>
> > Sent: Tuesday, August 6, 2013 11:42:37 PM
> > Subject: Re: Swift/K on Titan
> >
> >
> > Hi Mike,
> >
> >
> > Here is the *.log from the modis01 run.
> >
> >
> > -Scott
> >
> >
> >
> > On Tue, Aug 6, 2013 at 11:30 PM, Michael Wilde < wilde at mcs.anl.gov
> > >
> > wrote:
> >
> >
> > Scott, can you also send the large *.log file?
> >
> >
> >
> > ----- Original Message -----
> > > From: "Scott Krieder" < skrieder at iit.edu >
> > > To: "Michael Wilde" < wilde at mcs.anl.gov >
> > > Cc: "Swift Language" < davidk at ci.uchicago.edu >
> > > Sent: Tuesday, August 6, 2013 10:28:40 PM
> > > Subject: Swift/K on Titan
> > >
> > >
> > > Hi Mike,
> > >
> > >
> > > Just wanted to keep you in the loop. David and I got a lot closer
> > > to
> > > getting Swift/K running on Titan.
> > >
> > >
> > > We ended up leaving off with the following error:
> > > The Swift job would join the queue and show as running under
> > > qstat,
> > > but would only show as submitted through the swift report.
> > >
> > >
> > > The submit script was also writing this to stderr:
> > > ssh: connect to host 811 port 22: Invalid argument
> > >
> > >
> > >
> > > I've attached the submit script that swift generated as well as
> > > the
> > > stderr that was generated.
> > >
> > >
> > > -Scott
> > >
> > >
> > > --
> > > Scott J. Krieder
> > >
> > >
> > > C: 419-685-0410
> > >
> > > E: skrieder at iit.edu
> > >
> > > http://datasys.cs.iit.edu/~skrieder/
> >
> >
> >
> >
> > --
> > Scott J. Krieder
> >
> >
> > C: 419-685-0410
> >
> > E: skrieder at iit.edu
> >
> > http://datasys.cs.iit.edu/~skrieder/
>
>
>
>
> --
> Scott J. Krieder
>
>
> C: 419-685-0410
>
> E: skrieder at iit.edu
>
> http://datasys.cs.iit.edu/~skrieder/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: modis01-20130807-1005-u53lntkc.log
Type: text/x-log
Size: 17886 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20130807/1361b216/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PBS1071539250834089601.submit
Type: application/octet-stream
Size: 1312 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20130807/1361b216/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PBS1071539250834089601.submit.stderr
Type: application/octet-stream
Size: 863 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20130807/1361b216/attachment-0001.obj>
More information about the Swift-devel
mailing list