[Swift-devel] swift pbs/beagle broken

Michael Wilde wilde at mcs.anl.gov
Sun Nov 13 20:34:51 CST 2011


OK. I tested with a 1-job run and a 1000 job run.

I committed this "feature" to 0.93 as Swift rev 5285. Im assuming its safe because no one is likely to set SWIFT_USERHOME unless we tell them to.

David, can you build and post a new RC?
(Do you know how to mark the release as 0.93RC5 per the method Justin described in our last meeting? So that it shows up in the Swift log as that release name...)

Ketan, can you see if this now gets Fangfang rolling?

Thanks, all.

- Mike



----- Original Message -----
> From: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
> To: "Michael Wilde" <wilde at mcs.anl.gov>
> Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Sunday, November 13, 2011 6:05:15 PM
> Subject: Re: [Swift-devel] swift pbs/beagle broken
> This fix works for me. I tested with one catsn job on Beagle.
> 
> 
> On Sun, Nov 13, 2011 at 7:48 PM, Michael Wilde < wilde at mcs.anl.gov >
> wrote:
> 
> 
> OK, here is a simple fix for this problem. Just add the variable
> "SWIFT_USERHOME" to your swift command; then do:
> 
> export SWIFT_USERHOME=/lustre/beagle/wilde
> swift etc
> 
> This makes swift use $SWIFT_USERHOME instead of $HOME to locate the
> .globus directory.
> 
> This will of course mess up if a swift run needs to locate your
> certificates; possibly you can get around that with a symlink. But I
> suspect most uses of this will be for local execution on systems like
> Beagle with non-writeable home dirs.
> 
> Here's the 1-line fix:
> 
> login$ pwd
> /home/wilde/swift/src/0.93/cog/modules/swift/bin
> login$ svn diff
> Index: swift
> ===================================================================
> --- swift (revision 5284)
> +++ swift (working copy)
> @@ -86,6 +86,7 @@
> updateOptions "$X509_USER_PROXY" "X509_USER_PROXY"
> updateOptions "$SWIFT_HOME" "COG_INSTALL_PATH"
> updateOptions "$SWIFT_HOME" "swift.home"
> +updateOptions "$SWIFT_USERHOME" "user.home"
> #Use /dev/urandom instead of /dev/random for seeding RNGs
> #This will lower the randomness of the seed, but avoid
> #large delays if /dev/random does not have enough entropy collected
> login$
> 
> If others can confirm that this works, I'll check it in.
> 
> 
> - Mike
> 
> 
> 
> ----- Original Message -----
> > From: "Michael Wilde" < wilde at mcs.anl.gov >
> > To: "Ketan Maheshwari" < ketancmaheshwari at gmail.com >
> > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu >
> 
> 
> 
> > Sent: Sunday, November 13, 2011 3:08:36 PM
> > Subject: Re: [Swift-devel] swift pbs/beagle broken
> > OK, as some of you can see in the mesg I just send to
> > beagle-support:
> > it now looks ot me like the root problem of the swift jobs failing
> > is
> > that our home dirs are not beng seen on the compute nodes, hance the
> > swift-generated PBS script to launch the coaster workers cant find
> > the
> > worker.pl script that swift copied to $HOME/.globus/coasters.
> >
> > This is what I see:
> >
> > The following was run under qsub -I; the line "total 0" shows that
> > /home/wilde was empty as seen by the compute node.
> >
> > login1$ aprun /bin/sh -c 'hostname; ls -l /home/wilde/; mount | grep
> > home; '
> > nid00466
> > total 0
> > /autonfs/home on /autonfs/home type dvs
> > (ro,blksize=16384,nodename=c1-0c0s7n3:c4-0c0s2n0:c4-0c0s2n1:c4-0c0s2n2:c4-0c0s2n3,attrcache_timeout=14400,cache,nodatasync,noclosesync,retry,failover,userenv,clusterfs,killprocess,nobulk_rw,noatomic,nodeferopens,loadbalance,maxnodes=1,nnodes=5)
> > /autonfs/home on /autonfs/home type dvs
> > (ro,blksize=16384,nodename=c1-0c0s7n3:c4-0c0s2n0:c4-0c0s2n1:c4-0c0s2n2:c4-0c0s2n3,attrcache_timeout=14400,cache,nodatasync,noclosesync,retry,failover,userenv,clusterfs,killprocess,nobulk_rw,noatomic,nodeferopens,loadbalance,maxnodes=1,nnodes=5)
> > Application 863284 resources: utime ~0s, stime ~0s
> > login1$
> >
> > Can anyone verify that they are seeing the same symptom?
> >
> > Thanks,
> >
> > - Mike
> >
> >
> > ----- Original Message -----
> > > From: "Michael Wilde" < wilde at mcs.anl.gov >
> > > To: "Ketan Maheshwari" < ketancmaheshwari at gmail.com >
> > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu >
> > > Sent: Sunday, November 13, 2011 2:41:36 PM
> > > Subject: Re: [Swift-devel] swift pbs/beagle broken
> > > I tracked the message below down to the fact that aprun doesnt
> > > like
> > > "&" in its command string. I vaguely recall reporting something
> > > similar to Cray way back and they agreed its a bug.
> > >
> > > But it seems that the *original* Swift command string did not have
> > > a
> > > "&" in it, so Im back to square one.
> > >
> > > - Mike
> > >
> > > ----- Original Message -----
> > > > From: "Michael Wilde" < wilde at mcs.anl.gov >
> > > > To: "Ketan Maheshwari" < ketancmaheshwari at gmail.com >
> > > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu >
> > > > Sent: Sunday, November 13, 2011 1:52:58 PM
> > > > Subject: Re: [Swift-devel] swift pbs/beagle broken
> > > > Its starting to look like some kind of aprun-based failure. I
> > > > see
> > > > this
> > > > from more detailed logging I put into the generated script:
> > > >
> > > > IN .submit script
> > > > aprun: Unexpected close of the apsys control connection
> > > > aprun: Exiting due to errors. Application aborted
> > > > aprun rc 1
> > > >
> > > > I was led off track by the fact that the exitcode file is
> > > > missing.
> > > > Seems that its generated but then removed before we can see it.
> > > > I
> > > > suspect one part of the provider thinks the worker-launch job
> > > > succeeded, and hence removes the exitcode file, but another part
> > > > realizes that the job failed. (conjecture...)
> > > >
> > > > Now that that part is partially explained, I think I can go back
> > > > to
> > > > debugging this from manual qsubs which should go faster.
> > > >
> > > > Im still unsure if the missing stdout/err files is due to a
> > > > Beagle
> > > > issue; starting to look more like maybe due to the weird way in
> > > > which
> > > > the aprun dies.
> > > >
> > > > Digging deeper...
> > > >
> > > > - Mike
> > > >
> > > > ----- Original Message -----
> > > > > From: "Michael Wilde" < wilde at mcs.anl.gov >
> > > > > To: "Ketan Maheshwari" < ketancmaheshwari at gmail.com >
> > > > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu >
> > > > > Sent: Sunday, November 13, 2011 7:51:57 AM
> > > > > Subject: Re: [Swift-devel] swift pbs/beagle broken
> > > > > Ive backed up and just did a test from swift (automatic)
> > > > >
> > > > > I see that in that case I am *not* getting an exitcode file.
> > > > > Are you getting one?
> > > > >
> > > > > - Mike
> > > > >
> > > > > ----- Original Message -----
> > > > > > From: "Michael Wilde" < wilde at mcs.anl.gov >
> > > > > > To: "Ketan Maheshwari" < ketancmaheshwari at gmail.com >
> > > > > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu >
> > > > > > Sent: Sunday, November 13, 2011 7:45:05 AM
> > > > > > Subject: Re: [Swift-devel] swift pbs/beagle broken
> > > > > > But if you put an explicit output redirection in the /bin/sh
> > > > > > -c
> > > > > > command, you will see that those commands are indeed
> > > > > > executing
> > > > > > and
> > > > > > generating output.
> > > > > >
> > > > > > So like I mentioned earlier, I dont know if the qsub -o and
> > > > > > -e
> > > > > > flags
> > > > > > have changed behavior (eg they now cant write to /home???),
> > > > > > or
> > > > > > if
> > > > > > we
> > > > > > are using them incorrectly.
> > > > > >
> > > > > > But I think we need to go backwards and see why this is not
> > > > > > working
> > > > > > with the swift-generated qsub files.
> > > > > >
> > > > > > We should next add the two tags to the sites file to obtain
> > > > > > a
> > > > > > log
> > > > > > from
> > > > > > the worker, on the (untested!) assumption that the worker is
> > > > > > really
> > > > > > starting in the automatic swift case:
> > > > > >
> > > > > > <profile namespace="globus"
> > > > > > key="workerLoggingLevel">DEBUG</profile>
> > > > > > <profile namespace="globus"
> > > > > > key="workerLoggingDirectory">/lustre/beagle/wilde/beagle</profile>
> > > > > >
> > > > > > - Mike
> > > > > >
> > > > > >
> > > > > > ----- Original Message -----
> > > > > > > From: "Ketan Maheshwari" < ketancmaheshwari at gmail.com >
> > > > > > > To: "Michael Wilde" < wilde at mcs.anl.gov >
> > > > > > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu >
> > > > > > > Sent: Sunday, November 13, 2011 7:35:24 AM
> > > > > > > Subject: Re: [Swift-devel] swift pbs/beagle broken
> > > > > > > On Sun, Nov 13, 2011 at 9:28 AM, Michael Wilde <
> > > > > > > wilde at mcs.anl.gov
> > > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > >
> > > > > > > 2 thoughts here, Ketan:
> > > > > > >
> > > > > > > - when I tried my manual coaster test, I replaced the
> > > > > > > options
> > > > > > > "-n
> > > > > > > 3
> > > > > > > -N
> > > > > > > 1 -cc none -d 24 -F exclusive" on aprun with simply "-B"
> > > > > > > which
> > > > > > > says
> > > > > > > "use the options from qsub". I was going to go back and
> > > > > > > see
> > > > > > > if
> > > > > > > there
> > > > > > > was some subtle new mismatch between these qsub and aprun
> > > > > > > processor-layout options.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > I tried the -B option:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > #CoG This script generated by CoG
> > > > > > > #CoG by class: class
> > > > > > > org.globus.cog.abstraction.impl.scheduler.pbs.PBSExecutor
> > > > > > > #CoG on date: 2011/11/13 02:16:54
> > > > > > >
> > > > > > >
> > > > > > > #PBS -S /bin/bash
> > > > > > > #PBS -N Block-1113-1602
> > > > > > > #PBS -m n
> > > > > > > #PBS -A CI-DEB000002
> > > > > > > #PBS -l mppwidth=3,mppnppn=1,mppdepth=24
> > > > > > > #PBS -l walltime=00:10:00
> > > > > > > #PBS -o
> > > > > > > /home/ketan/.globus/scripts/PBS2583661693904024220.submit.stdout
> > > > > > > #PBS -e
> > > > > > > /home/ketan/.globus/scripts/PBS2583661693904024220.submit.stderr
> > > > > > > WORKER_LOGGING_LEVEL=NONE
> > > > > > > #PBS -v WORKER_LOGGING_LEVEL
> > > > > > > cd / && aprun -B /bin/sh -c /bin/date
> > > > > > > /bin/echo $?
> > > > > > > >/home/ketan/.globus/scripts/PBS2583661693904024220.submit.exitcode
> > > > > > >
> > > > > > >
> > > > > > > And see the same behavior. The exitcode file is indeed
> > > > > > > updated
> > > > > > > each
> > > > > > > time with a code 0.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > - I realized that manually testing the swift-generated
> > > > > > > submit
> > > > > > > file
> > > > > > > will give new errors because the swift service is no
> > > > > > > longer
> > > > > > > alive
> > > > > > > and
> > > > > > > listening on the port that the worker will try to connect
> > > > > > > to.
> > > > > > > Also,
> > > > > > > it
> > > > > > > seemed that the .pl file itself that automatic coaster
> > > > > > > bootstrap
> > > > > > > places in ~/.globus/coasters was not there. Im assuming
> > > > > > > that
> > > > > > > Swift
> > > > > > > removes these files when it exits, but need to verify that
> > > > > > > this
> > > > > > > is
> > > > > > > true and that the failure is not due to a missing .pl
> > > > > > > file.
> > > > > > > I
> > > > > > > suspect
> > > > > > > that this is normal and is not the problem, but again, we
> > > > > > > need
> > > > > > > to
> > > > > > > keep
> > > > > > > debugging until the root cause is found.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Mike
> > > > > > >
> > > > > > >
> > > > > > > ----- Original Message -----
> > > > > > >
> > > > > > > > From: "Ketan Maheshwari" < ketancmaheshwari at gmail.com >
> > > > > > > > To: "Michael Wilde" < wilde at mcs.anl.gov >
> > > > > > > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > Sent: Sunday, November 13, 2011 7:20:25 AM
> > > > > > > > Subject: Re: [Swift-devel] swift pbs/beagle broken
> > > > > > > > I tried with a simple /bin/date command at the end of
> > > > > > > > the
> > > > > > > > submit
> > > > > > > > script removing the call to worker.pl :
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > #CoG This script generated by CoG
> > > > > > > > #CoG by class: class
> > > > > > > > org.globus.cog.abstraction.impl.scheduler.pbs.PBSExecutor
> > > > > > > > #CoG on date: 2011/11/13 02:16:54
> > > > > > > >
> > > > > > > >
> > > > > > > > #PBS -S /bin/bash
> > > > > > > > #PBS -N Block-1113-1602
> > > > > > > > #PBS -m n
> > > > > > > > #PBS -A CI-DEB000002
> > > > > > > > #PBS -l mppwidth=3,mppnppn=1,mppdepth=24
> > > > > > > > #PBS -l walltime=00:10:00
> > > > > > > > #PBS -o
> > > > > > > > /home/ketan/.globus/scripts/PBS2583661693904024220.submit.stdout
> > > > > > > > #PBS -e
> > > > > > > > /home/ketan/.globus/scripts/PBS2583661693904024220.submit.stderr
> > > > > > > > WORKER_LOGGING_LEVEL=NONE
> > > > > > > > #PBS -v WORKER_LOGGING_LEVEL
> > > > > > > > cd / && aprun -n 3 -N 1 -cc none -d 24 -F exclusive
> > > > > > > > /bin/sh
> > > > > > > > -c
> > > > > > > > /bin/date
> > > > > > > >
> > > > > > > >
> > > > > > > > =======
> > > > > > > >
> > > > > > > >
> > > > > > > > This fails too. The queue cancels the job as soon as it
> > > > > > > > starts
> > > > > > > > running, without writing anything to stdout or stderr.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Sun, Nov 13, 2011 at 12:54 AM, Michael Wilde <
> > > > > > > > wilde at mcs.anl.gov
> > > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > OK, I dont need these; I can reproduce the problem as
> > > > > > > > well.
> > > > > > > >
> > > > > > > > For some reason, the coaster worker is exiting
> > > > > > > > immediately.
> > > > > > > >
> > > > > > > > I see a few possibilities:
> > > > > > > >
> > > > > > > > - Beagle networking may have changed, making it no
> > > > > > > > longer
> > > > > > > > possible
> > > > > > > > to
> > > > > > > > reach the coaster service from the compute nodes using
> > > > > > > > the
> > > > > > > > previous
> > > > > > > > IP
> > > > > > > > address ranges.
> > > > > > > >
> > > > > > > > - the worker.pl script is not being created in
> > > > > > > > $HOME/.globus/coasters
> > > > > > > >
> > > > > > > > Mike
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > ----- Original Message -----
> > > > > > > > > From: "Michael Wilde" < wilde at mcs.anl.gov >
> > > > > > > > > To: "Ketan Maheshwari" < ketancmaheshwari at gmail.com >
> > > > > > > > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu >
> > > > > > > > > Sent: Saturday, November 12, 2011 8:39:36 PM
> > > > > > > > > Subject: Re: [Swift-devel] swift pbs/beagle broken
> > > > > > > > > Ketan, can you post the submit script and site file?
> > > > > > > > >
> > > > > > > > > On 11/12/11, Ketan Maheshwari <
> > > > > > > > > ketancmaheshwari at gmail.com
> > > > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > It seems the pbs-coaster provider (local:pbs) is
> > > > > > > > > > broken
> > > > > > > > > > for
> > > > > > > > > > swift.
> > > > > > > > > > I
> > > > > > > > > > tried
> > > > > > > > > > swift trunk, 0.93 svn branch, 0.93RC3 and 0.93RC4
> > > > > > > > > > but
> > > > > > > > > > getting
> > > > > > > > > > the
> > > > > > > > > > same
> > > > > > > > > > response:
> > > > > > > > > >
> > > > > > > > > > Swift svn swift-r5205 cog-r3293
> > > > > > > > > >
> > > > > > > > > > RunID: 20111113-0216-1d35h7eb
> > > > > > > > > > Progress: time: Sun, 13 Nov 2011 02:16:54 +0000
> > > > > > > > > > site setting workersPerNode has been replaced with
> > > > > > > > > > jobsPerNode!
> > > > > > > > > > Progress: time: Sun, 13 Nov 2011 02:17:05 +0000
> > > > > > > > > > Active:1
> > > > > > > > > > Failed to transfer wrapper log for job cat-1hg8aoik
> > > > > > > > > > Exception in cat:
> > > > > > > > > > Arguments: [data.txt]
> > > > > > > > > > Host: pbs
> > > > > > > > > > Directory:
> > > > > > > > > > catsn-20111113-0216-1d35h7eb/jobs/1/cat-1hg8aoik
> > > > > > > > > > stderr.txt:
> > > > > > > > > >
> > > > > > > > > > stdout.txt:
> > > > > > > > > >
> > > > > > > > > > ----
> > > > > > > > > >
> > > > > > > > > > Caused by: Task failed: 1113-160254-000000 Block
> > > > > > > > > > task
> > > > > > > > > > ended
> > > > > > > > > > prematurely
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Final status: time: Sun, 13 Nov 2011 02:17:05 +0000
> > > > > > > > > > Failed:1
> > > > > > > > > > The following errors have occurred:
> > > > > > > > > > 1. Task failed: 1113-160254-000000 Block task ended
> > > > > > > > > > prematurely
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Trying the submit script outside of swift also does
> > > > > > > > > > not
> > > > > > > > > > seem
> > > > > > > > > > to
> > > > > > > > > > be
> > > > > > > > > > working.
> > > > > > > > > > The scripts get submitted to the queue and
> > > > > > > > > > immediately
> > > > > > > > > > exits
> > > > > > > > > > without
> > > > > > > > > > writing anything to stdout or stderr.
> > > > > > > > > >
> > > > > > > > > > Were there any recent changes that could have
> > > > > > > > > > affected
> > > > > > > > > > this?
> > > > > > > > > >
> > > > > > > > > > I remember to have tried this successfully in the
> > > > > > > > > > last
> > > > > > > > > > week
> > > > > > > > > > of
> > > > > > > > > > last
> > > > > > > > > > month.
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > > --
> > > > > > > > > > Ketan
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Sent from my mobile device
> > > > > > > > > _______________________________________________
> > > > > > > > > Swift-devel mailing list
> > > > > > > > > Swift-devel at ci.uchicago.edu
> > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> > > > > > > >
> > > > > > > > --
> > > > > > > > Michael Wilde
> > > > > > > > Computation Institute, University of Chicago
> > > > > > > > Mathematics and Computer Science Division
> > > > > > > > Argonne National Laboratory
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Ketan
> > > > > > >
> > > > > > > --
> > > > > > > Michael Wilde
> > > > > > > Computation Institute, University of Chicago
> > > > > > > Mathematics and Computer Science Division
> > > > > > > Argonne National Laboratory
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Ketan
> > > > > >
> > > > > > --
> > > > > > Michael Wilde
> > > > > > Computation Institute, University of Chicago
> > > > > > Mathematics and Computer Science Division
> > > > > > Argonne National Laboratory
> > > > > >
> > > > > > _______________________________________________
> > > > > > Swift-devel mailing list
> > > > > > Swift-devel at ci.uchicago.edu
> > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> > > > >
> > > > > --
> > > > > Michael Wilde
> > > > > Computation Institute, University of Chicago
> > > > > Mathematics and Computer Science Division
> > > > > Argonne National Laboratory
> > > > >
> > > > > _______________________________________________
> > > > > Swift-devel mailing list
> > > > > Swift-devel at ci.uchicago.edu
> > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> > > >
> > > > --
> > > > Michael Wilde
> > > > Computation Institute, University of Chicago
> > > > Mathematics and Computer Science Division
> > > > Argonne National Laboratory
> > > >
> > > > _______________________________________________
> > > > Swift-devel mailing list
> > > > Swift-devel at ci.uchicago.edu
> > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> > >
> > > --
> > > Michael Wilde
> > > Computation Institute, University of Chicago
> > > Mathematics and Computer Science Division
> > > Argonne National Laboratory
> >
> > --
> > Michael Wilde
> > Computation Institute, University of Chicago
> > Mathematics and Computer Science Division
> > Argonne National Laboratory
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> 
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 
> 
> 
> 
> 
> --
> Ketan

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list