[Swift-devel] swift pbs/beagle broken
Ketan Maheshwari
ketancmaheshwari at gmail.com
Sun Nov 13 20:05:15 CST 2011
This fix works for me. I tested with one catsn job on Beagle.
On Sun, Nov 13, 2011 at 7:48 PM, Michael Wilde <wilde at mcs.anl.gov> wrote:
> OK, here is a simple fix for this problem. Just add the variable
> "SWIFT_USERHOME" to your swift command; then do:
>
> export SWIFT_USERHOME=/lustre/beagle/wilde
> swift etc
>
> This makes swift use $SWIFT_USERHOME instead of $HOME to locate the
> .globus directory.
>
> This will of course mess up if a swift run needs to locate your
> certificates; possibly you can get around that with a symlink. But I
> suspect most uses of this will be for local execution on systems like
> Beagle with non-writeable home dirs.
>
> Here's the 1-line fix:
>
> login$ pwd
> /home/wilde/swift/src/0.93/cog/modules/swift/bin
> login$ svn diff
> Index: swift
> ===================================================================
> --- swift (revision 5284)
> +++ swift (working copy)
> @@ -86,6 +86,7 @@
> updateOptions "$X509_USER_PROXY" "X509_USER_PROXY"
> updateOptions "$SWIFT_HOME" "COG_INSTALL_PATH"
> updateOptions "$SWIFT_HOME" "swift.home"
> +updateOptions "$SWIFT_USERHOME" "user.home"
> #Use /dev/urandom instead of /dev/random for seeding RNGs
> #This will lower the randomness of the seed, but avoid
> #large delays if /dev/random does not have enough entropy collected
> login$
>
> If others can confirm that this works, I'll check it in.
>
> - Mike
>
>
>
> ----- Original Message -----
> > From: "Michael Wilde" <wilde at mcs.anl.gov>
> > To: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
> > Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> > Sent: Sunday, November 13, 2011 3:08:36 PM
> > Subject: Re: [Swift-devel] swift pbs/beagle broken
> > OK, as some of you can see in the mesg I just send to beagle-support:
> > it now looks ot me like the root problem of the swift jobs failing is
> > that our home dirs are not beng seen on the compute nodes, hance the
> > swift-generated PBS script to launch the coaster workers cant find the
> > worker.pl script that swift copied to $HOME/.globus/coasters.
> >
> > This is what I see:
> >
> > The following was run under qsub -I; the line "total 0" shows that
> > /home/wilde was empty as seen by the compute node.
> >
> > login1$ aprun /bin/sh -c 'hostname; ls -l /home/wilde/; mount | grep
> > home; '
> > nid00466
> > total 0
> > /autonfs/home on /autonfs/home type dvs
> >
> (ro,blksize=16384,nodename=c1-0c0s7n3:c4-0c0s2n0:c4-0c0s2n1:c4-0c0s2n2:c4-0c0s2n3,attrcache_timeout=14400,cache,nodatasync,noclosesync,retry,failover,userenv,clusterfs,killprocess,nobulk_rw,noatomic,nodeferopens,loadbalance,maxnodes=1,nnodes=5)
> > /autonfs/home on /autonfs/home type dvs
> >
> (ro,blksize=16384,nodename=c1-0c0s7n3:c4-0c0s2n0:c4-0c0s2n1:c4-0c0s2n2:c4-0c0s2n3,attrcache_timeout=14400,cache,nodatasync,noclosesync,retry,failover,userenv,clusterfs,killprocess,nobulk_rw,noatomic,nodeferopens,loadbalance,maxnodes=1,nnodes=5)
> > Application 863284 resources: utime ~0s, stime ~0s
> > login1$
> >
> > Can anyone verify that they are seeing the same symptom?
> >
> > Thanks,
> >
> > - Mike
> >
> >
> > ----- Original Message -----
> > > From: "Michael Wilde" <wilde at mcs.anl.gov>
> > > To: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
> > > Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> > > Sent: Sunday, November 13, 2011 2:41:36 PM
> > > Subject: Re: [Swift-devel] swift pbs/beagle broken
> > > I tracked the message below down to the fact that aprun doesnt like
> > > "&" in its command string. I vaguely recall reporting something
> > > similar to Cray way back and they agreed its a bug.
> > >
> > > But it seems that the *original* Swift command string did not have a
> > > "&" in it, so Im back to square one.
> > >
> > > - Mike
> > >
> > > ----- Original Message -----
> > > > From: "Michael Wilde" <wilde at mcs.anl.gov>
> > > > To: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
> > > > Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> > > > Sent: Sunday, November 13, 2011 1:52:58 PM
> > > > Subject: Re: [Swift-devel] swift pbs/beagle broken
> > > > Its starting to look like some kind of aprun-based failure. I see
> > > > this
> > > > from more detailed logging I put into the generated script:
> > > >
> > > > IN .submit script
> > > > aprun: Unexpected close of the apsys control connection
> > > > aprun: Exiting due to errors. Application aborted
> > > > aprun rc 1
> > > >
> > > > I was led off track by the fact that the exitcode file is missing.
> > > > Seems that its generated but then removed before we can see it. I
> > > > suspect one part of the provider thinks the worker-launch job
> > > > succeeded, and hence removes the exitcode file, but another part
> > > > realizes that the job failed. (conjecture...)
> > > >
> > > > Now that that part is partially explained, I think I can go back
> > > > to
> > > > debugging this from manual qsubs which should go faster.
> > > >
> > > > Im still unsure if the missing stdout/err files is due to a Beagle
> > > > issue; starting to look more like maybe due to the weird way in
> > > > which
> > > > the aprun dies.
> > > >
> > > > Digging deeper...
> > > >
> > > > - Mike
> > > >
> > > > ----- Original Message -----
> > > > > From: "Michael Wilde" <wilde at mcs.anl.gov>
> > > > > To: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
> > > > > Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> > > > > Sent: Sunday, November 13, 2011 7:51:57 AM
> > > > > Subject: Re: [Swift-devel] swift pbs/beagle broken
> > > > > Ive backed up and just did a test from swift (automatic)
> > > > >
> > > > > I see that in that case I am *not* getting an exitcode file.
> > > > > Are you getting one?
> > > > >
> > > > > - Mike
> > > > >
> > > > > ----- Original Message -----
> > > > > > From: "Michael Wilde" <wilde at mcs.anl.gov>
> > > > > > To: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
> > > > > > Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> > > > > > Sent: Sunday, November 13, 2011 7:45:05 AM
> > > > > > Subject: Re: [Swift-devel] swift pbs/beagle broken
> > > > > > But if you put an explicit output redirection in the /bin/sh
> > > > > > -c
> > > > > > command, you will see that those commands are indeed executing
> > > > > > and
> > > > > > generating output.
> > > > > >
> > > > > > So like I mentioned earlier, I dont know if the qsub -o and -e
> > > > > > flags
> > > > > > have changed behavior (eg they now cant write to /home???), or
> > > > > > if
> > > > > > we
> > > > > > are using them incorrectly.
> > > > > >
> > > > > > But I think we need to go backwards and see why this is not
> > > > > > working
> > > > > > with the swift-generated qsub files.
> > > > > >
> > > > > > We should next add the two tags to the sites file to obtain a
> > > > > > log
> > > > > > from
> > > > > > the worker, on the (untested!) assumption that the worker is
> > > > > > really
> > > > > > starting in the automatic swift case:
> > > > > >
> > > > > > <profile namespace="globus"
> > > > > > key="workerLoggingLevel">DEBUG</profile>
> > > > > > <profile namespace="globus"
> > > > > >
> key="workerLoggingDirectory">/lustre/beagle/wilde/beagle</profile>
> > > > > >
> > > > > > - Mike
> > > > > >
> > > > > >
> > > > > > ----- Original Message -----
> > > > > > > From: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
> > > > > > > To: "Michael Wilde" <wilde at mcs.anl.gov>
> > > > > > > Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> > > > > > > Sent: Sunday, November 13, 2011 7:35:24 AM
> > > > > > > Subject: Re: [Swift-devel] swift pbs/beagle broken
> > > > > > > On Sun, Nov 13, 2011 at 9:28 AM, Michael Wilde <
> > > > > > > wilde at mcs.anl.gov
> > > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > >
> > > > > > > 2 thoughts here, Ketan:
> > > > > > >
> > > > > > > - when I tried my manual coaster test, I replaced the
> > > > > > > options
> > > > > > > "-n
> > > > > > > 3
> > > > > > > -N
> > > > > > > 1 -cc none -d 24 -F exclusive" on aprun with simply "-B"
> > > > > > > which
> > > > > > > says
> > > > > > > "use the options from qsub". I was going to go back and see
> > > > > > > if
> > > > > > > there
> > > > > > > was some subtle new mismatch between these qsub and aprun
> > > > > > > processor-layout options.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > I tried the -B option:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > #CoG This script generated by CoG
> > > > > > > #CoG by class: class
> > > > > > > org.globus.cog.abstraction.impl.scheduler.pbs.PBSExecutor
> > > > > > > #CoG on date: 2011/11/13 02:16:54
> > > > > > >
> > > > > > >
> > > > > > > #PBS -S /bin/bash
> > > > > > > #PBS -N Block-1113-1602
> > > > > > > #PBS -m n
> > > > > > > #PBS -A CI-DEB000002
> > > > > > > #PBS -l mppwidth=3,mppnppn=1,mppdepth=24
> > > > > > > #PBS -l walltime=00:10:00
> > > > > > > #PBS -o
> > > > > > >
> /home/ketan/.globus/scripts/PBS2583661693904024220.submit.stdout
> > > > > > > #PBS -e
> > > > > > >
> /home/ketan/.globus/scripts/PBS2583661693904024220.submit.stderr
> > > > > > > WORKER_LOGGING_LEVEL=NONE
> > > > > > > #PBS -v WORKER_LOGGING_LEVEL
> > > > > > > cd / && aprun -B /bin/sh -c /bin/date
> > > > > > > /bin/echo $?
> > > > > > >
> >/home/ketan/.globus/scripts/PBS2583661693904024220.submit.exitcode
> > > > > > >
> > > > > > >
> > > > > > > And see the same behavior. The exitcode file is indeed
> > > > > > > updated
> > > > > > > each
> > > > > > > time with a code 0.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > - I realized that manually testing the swift-generated
> > > > > > > submit
> > > > > > > file
> > > > > > > will give new errors because the swift service is no longer
> > > > > > > alive
> > > > > > > and
> > > > > > > listening on the port that the worker will try to connect
> > > > > > > to.
> > > > > > > Also,
> > > > > > > it
> > > > > > > seemed that the .pl file itself that automatic coaster
> > > > > > > bootstrap
> > > > > > > places in ~/.globus/coasters was not there. Im assuming that
> > > > > > > Swift
> > > > > > > removes these files when it exits, but need to verify that
> > > > > > > this
> > > > > > > is
> > > > > > > true and that the failure is not due to a missing .pl file.
> > > > > > > I
> > > > > > > suspect
> > > > > > > that this is normal and is not the problem, but again, we
> > > > > > > need
> > > > > > > to
> > > > > > > keep
> > > > > > > debugging until the root cause is found.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Mike
> > > > > > >
> > > > > > >
> > > > > > > ----- Original Message -----
> > > > > > >
> > > > > > > > From: "Ketan Maheshwari" < ketancmaheshwari at gmail.com >
> > > > > > > > To: "Michael Wilde" < wilde at mcs.anl.gov >
> > > > > > > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > Sent: Sunday, November 13, 2011 7:20:25 AM
> > > > > > > > Subject: Re: [Swift-devel] swift pbs/beagle broken
> > > > > > > > I tried with a simple /bin/date command at the end of the
> > > > > > > > submit
> > > > > > > > script removing the call to worker.pl :
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > #CoG This script generated by CoG
> > > > > > > > #CoG by class: class
> > > > > > > > org.globus.cog.abstraction.impl.scheduler.pbs.PBSExecutor
> > > > > > > > #CoG on date: 2011/11/13 02:16:54
> > > > > > > >
> > > > > > > >
> > > > > > > > #PBS -S /bin/bash
> > > > > > > > #PBS -N Block-1113-1602
> > > > > > > > #PBS -m n
> > > > > > > > #PBS -A CI-DEB000002
> > > > > > > > #PBS -l mppwidth=3,mppnppn=1,mppdepth=24
> > > > > > > > #PBS -l walltime=00:10:00
> > > > > > > > #PBS -o
> > > > > > > >
> /home/ketan/.globus/scripts/PBS2583661693904024220.submit.stdout
> > > > > > > > #PBS -e
> > > > > > > >
> /home/ketan/.globus/scripts/PBS2583661693904024220.submit.stderr
> > > > > > > > WORKER_LOGGING_LEVEL=NONE
> > > > > > > > #PBS -v WORKER_LOGGING_LEVEL
> > > > > > > > cd / && aprun -n 3 -N 1 -cc none -d 24 -F exclusive
> > > > > > > > /bin/sh
> > > > > > > > -c
> > > > > > > > /bin/date
> > > > > > > >
> > > > > > > >
> > > > > > > > =======
> > > > > > > >
> > > > > > > >
> > > > > > > > This fails too. The queue cancels the job as soon as it
> > > > > > > > starts
> > > > > > > > running, without writing anything to stdout or stderr.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Sun, Nov 13, 2011 at 12:54 AM, Michael Wilde <
> > > > > > > > wilde at mcs.anl.gov
> > > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > OK, I dont need these; I can reproduce the problem as
> > > > > > > > well.
> > > > > > > >
> > > > > > > > For some reason, the coaster worker is exiting
> > > > > > > > immediately.
> > > > > > > >
> > > > > > > > I see a few possibilities:
> > > > > > > >
> > > > > > > > - Beagle networking may have changed, making it no longer
> > > > > > > > possible
> > > > > > > > to
> > > > > > > > reach the coaster service from the compute nodes using the
> > > > > > > > previous
> > > > > > > > IP
> > > > > > > > address ranges.
> > > > > > > >
> > > > > > > > - the worker.pl script is not being created in
> > > > > > > > $HOME/.globus/coasters
> > > > > > > >
> > > > > > > > Mike
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > ----- Original Message -----
> > > > > > > > > From: "Michael Wilde" < wilde at mcs.anl.gov >
> > > > > > > > > To: "Ketan Maheshwari" < ketancmaheshwari at gmail.com >
> > > > > > > > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu >
> > > > > > > > > Sent: Saturday, November 12, 2011 8:39:36 PM
> > > > > > > > > Subject: Re: [Swift-devel] swift pbs/beagle broken
> > > > > > > > > Ketan, can you post the submit script and site file?
> > > > > > > > >
> > > > > > > > > On 11/12/11, Ketan Maheshwari <
> > > > > > > > > ketancmaheshwari at gmail.com
> > > > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > It seems the pbs-coaster provider (local:pbs) is
> > > > > > > > > > broken
> > > > > > > > > > for
> > > > > > > > > > swift.
> > > > > > > > > > I
> > > > > > > > > > tried
> > > > > > > > > > swift trunk, 0.93 svn branch, 0.93RC3 and 0.93RC4 but
> > > > > > > > > > getting
> > > > > > > > > > the
> > > > > > > > > > same
> > > > > > > > > > response:
> > > > > > > > > >
> > > > > > > > > > Swift svn swift-r5205 cog-r3293
> > > > > > > > > >
> > > > > > > > > > RunID: 20111113-0216-1d35h7eb
> > > > > > > > > > Progress: time: Sun, 13 Nov 2011 02:16:54 +0000
> > > > > > > > > > site setting workersPerNode has been replaced with
> > > > > > > > > > jobsPerNode!
> > > > > > > > > > Progress: time: Sun, 13 Nov 2011 02:17:05 +0000
> > > > > > > > > > Active:1
> > > > > > > > > > Failed to transfer wrapper log for job cat-1hg8aoik
> > > > > > > > > > Exception in cat:
> > > > > > > > > > Arguments: [data.txt]
> > > > > > > > > > Host: pbs
> > > > > > > > > > Directory:
> > > > > > > > > > catsn-20111113-0216-1d35h7eb/jobs/1/cat-1hg8aoik
> > > > > > > > > > stderr.txt:
> > > > > > > > > >
> > > > > > > > > > stdout.txt:
> > > > > > > > > >
> > > > > > > > > > ----
> > > > > > > > > >
> > > > > > > > > > Caused by: Task failed: 1113-160254-000000 Block task
> > > > > > > > > > ended
> > > > > > > > > > prematurely
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Final status: time: Sun, 13 Nov 2011 02:17:05 +0000
> > > > > > > > > > Failed:1
> > > > > > > > > > The following errors have occurred:
> > > > > > > > > > 1. Task failed: 1113-160254-000000 Block task ended
> > > > > > > > > > prematurely
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Trying the submit script outside of swift also does
> > > > > > > > > > not
> > > > > > > > > > seem
> > > > > > > > > > to
> > > > > > > > > > be
> > > > > > > > > > working.
> > > > > > > > > > The scripts get submitted to the queue and immediately
> > > > > > > > > > exits
> > > > > > > > > > without
> > > > > > > > > > writing anything to stdout or stderr.
> > > > > > > > > >
> > > > > > > > > > Were there any recent changes that could have affected
> > > > > > > > > > this?
> > > > > > > > > >
> > > > > > > > > > I remember to have tried this successfully in the last
> > > > > > > > > > week
> > > > > > > > > > of
> > > > > > > > > > last
> > > > > > > > > > month.
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > > --
> > > > > > > > > > Ketan
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Sent from my mobile device
> > > > > > > > > _______________________________________________
> > > > > > > > > Swift-devel mailing list
> > > > > > > > > Swift-devel at ci.uchicago.edu
> > > > > > > > >
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> > > > > > > >
> > > > > > > > --
> > > > > > > > Michael Wilde
> > > > > > > > Computation Institute, University of Chicago
> > > > > > > > Mathematics and Computer Science Division
> > > > > > > > Argonne National Laboratory
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Ketan
> > > > > > >
> > > > > > > --
> > > > > > > Michael Wilde
> > > > > > > Computation Institute, University of Chicago
> > > > > > > Mathematics and Computer Science Division
> > > > > > > Argonne National Laboratory
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Ketan
> > > > > >
> > > > > > --
> > > > > > Michael Wilde
> > > > > > Computation Institute, University of Chicago
> > > > > > Mathematics and Computer Science Division
> > > > > > Argonne National Laboratory
> > > > > >
> > > > > > _______________________________________________
> > > > > > Swift-devel mailing list
> > > > > > Swift-devel at ci.uchicago.edu
> > > > > >
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> > > > >
> > > > > --
> > > > > Michael Wilde
> > > > > Computation Institute, University of Chicago
> > > > > Mathematics and Computer Science Division
> > > > > Argonne National Laboratory
> > > > >
> > > > > _______________________________________________
> > > > > Swift-devel mailing list
> > > > > Swift-devel at ci.uchicago.edu
> > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> > > >
> > > > --
> > > > Michael Wilde
> > > > Computation Institute, University of Chicago
> > > > Mathematics and Computer Science Division
> > > > Argonne National Laboratory
> > > >
> > > > _______________________________________________
> > > > Swift-devel mailing list
> > > > Swift-devel at ci.uchicago.edu
> > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> > >
> > > --
> > > Michael Wilde
> > > Computation Institute, University of Chicago
> > > Mathematics and Computer Science Division
> > > Argonne National Laboratory
> >
> > --
> > Michael Wilde
> > Computation Institute, University of Chicago
> > Mathematics and Computer Science Division
> > Argonne National Laboratory
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
>
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
>
>
--
Ketan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20111113/49b32a6f/attachment.html>
More information about the Swift-devel
mailing list