[Swift-devel] swift pbs/beagle broken

Michael Wilde wilde at mcs.anl.gov
Sun Nov 13 09:45:05 CST 2011


But if you put an explicit output redirection in the /bin/sh -c command, you will see that those commands are indeed executing and generating output.

So like I mentioned earlier, I dont know if the qsub -o and -e flags have changed behavior (eg they now cant write to /home???), or if we are using them incorrectly.

But I think we need to go backwards and see why this is not working with the swift-generated qsub files.

We should next add the two tags to the sites file to obtain a log from the worker, on the (untested!) assumption that the worker is really starting in the automatic swift case:

    <profile namespace="globus" key="workerLoggingLevel">DEBUG</profile>
    <profile namespace="globus" key="workerLoggingDirectory">/lustre/beagle/wilde/beagle</profile>

- Mike


----- Original Message -----
> From: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
> To: "Michael Wilde" <wilde at mcs.anl.gov>
> Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Sunday, November 13, 2011 7:35:24 AM
> Subject: Re: [Swift-devel] swift pbs/beagle broken
> On Sun, Nov 13, 2011 at 9:28 AM, Michael Wilde < wilde at mcs.anl.gov >
> wrote:
> 
> 
> 2 thoughts here, Ketan:
> 
> - when I tried my manual coaster test, I replaced the options "-n 3 -N
> 1 -cc none -d 24 -F exclusive" on aprun with simply "-B" which says
> "use the options from qsub". I was going to go back and see if there
> was some subtle new mismatch between these qsub and aprun
> processor-layout options.
> 
> 
> 
> I tried the -B option:
> 
> 
> 
> #CoG This script generated by CoG
> #CoG by class: class
> org.globus.cog.abstraction.impl.scheduler.pbs.PBSExecutor
> #CoG on date: 2011/11/13 02:16:54
> 
> 
> #PBS -S /bin/bash
> #PBS -N Block-1113-1602
> #PBS -m n
> #PBS -A CI-DEB000002
> #PBS -l mppwidth=3,mppnppn=1,mppdepth=24
> #PBS -l walltime=00:10:00
> #PBS -o
> /home/ketan/.globus/scripts/PBS2583661693904024220.submit.stdout
> #PBS -e
> /home/ketan/.globus/scripts/PBS2583661693904024220.submit.stderr
> WORKER_LOGGING_LEVEL=NONE
> #PBS -v WORKER_LOGGING_LEVEL
> cd / && aprun -B /bin/sh -c /bin/date
> /bin/echo $?
> >/home/ketan/.globus/scripts/PBS2583661693904024220.submit.exitcode
> 
> 
> And see the same behavior. The exitcode file is indeed updated each
> time with a code 0.
> 
> 
> 
> - I realized that manually testing the swift-generated submit file
> will give new errors because the swift service is no longer alive and
> listening on the port that the worker will try to connect to. Also, it
> seemed that the .pl file itself that automatic coaster bootstrap
> places in ~/.globus/coasters was not there. Im assuming that Swift
> removes these files when it exits, but need to verify that this is
> true and that the failure is not due to a missing .pl file. I suspect
> that this is normal and is not the problem, but again, we need to keep
> debugging until the root cause is found.
> 
> 
> 
> Mike
> 
> 
> ----- Original Message -----
> 
> > From: "Ketan Maheshwari" < ketancmaheshwari at gmail.com >
> > To: "Michael Wilde" < wilde at mcs.anl.gov >
> > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu >
> 
> 
> 
> > Sent: Sunday, November 13, 2011 7:20:25 AM
> > Subject: Re: [Swift-devel] swift pbs/beagle broken
> > I tried with a simple /bin/date command at the end of the submit
> > script removing the call to worker.pl :
> >
> >
> >
> > #CoG This script generated by CoG
> > #CoG by class: class
> > org.globus.cog.abstraction.impl.scheduler.pbs.PBSExecutor
> > #CoG on date: 2011/11/13 02:16:54
> >
> >
> > #PBS -S /bin/bash
> > #PBS -N Block-1113-1602
> > #PBS -m n
> > #PBS -A CI-DEB000002
> > #PBS -l mppwidth=3,mppnppn=1,mppdepth=24
> > #PBS -l walltime=00:10:00
> > #PBS -o
> > /home/ketan/.globus/scripts/PBS2583661693904024220.submit.stdout
> > #PBS -e
> > /home/ketan/.globus/scripts/PBS2583661693904024220.submit.stderr
> > WORKER_LOGGING_LEVEL=NONE
> > #PBS -v WORKER_LOGGING_LEVEL
> > cd / && aprun -n 3 -N 1 -cc none -d 24 -F exclusive /bin/sh -c
> > /bin/date
> >
> >
> > =======
> >
> >
> > This fails too. The queue cancels the job as soon as it starts
> > running, without writing anything to stdout or stderr.
> >
> >
> >
> > On Sun, Nov 13, 2011 at 12:54 AM, Michael Wilde < wilde at mcs.anl.gov
> > >
> > wrote:
> >
> >
> > OK, I dont need these; I can reproduce the problem as well.
> >
> > For some reason, the coaster worker is exiting immediately.
> >
> > I see a few possibilities:
> >
> > - Beagle networking may have changed, making it no longer possible
> > to
> > reach the coaster service from the compute nodes using the previous
> > IP
> > address ranges.
> >
> > - the worker.pl script is not being created in
> > $HOME/.globus/coasters
> >
> > Mike
> >
> >
> >
> >
> >
> > ----- Original Message -----
> > > From: "Michael Wilde" < wilde at mcs.anl.gov >
> > > To: "Ketan Maheshwari" < ketancmaheshwari at gmail.com >
> > > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu >
> > > Sent: Saturday, November 12, 2011 8:39:36 PM
> > > Subject: Re: [Swift-devel] swift pbs/beagle broken
> > > Ketan, can you post the submit script and site file?
> > >
> > > On 11/12/11, Ketan Maheshwari < ketancmaheshwari at gmail.com >
> > > wrote:
> > > > Hi,
> > > >
> > > > It seems the pbs-coaster provider (local:pbs) is broken for
> > > > swift.
> > > > I
> > > > tried
> > > > swift trunk, 0.93 svn branch, 0.93RC3 and 0.93RC4 but getting
> > > > the
> > > > same
> > > > response:
> > > >
> > > > Swift svn swift-r5205 cog-r3293
> > > >
> > > > RunID: 20111113-0216-1d35h7eb
> > > > Progress: time: Sun, 13 Nov 2011 02:16:54 +0000
> > > > site setting workersPerNode has been replaced with jobsPerNode!
> > > > Progress: time: Sun, 13 Nov 2011 02:17:05 +0000 Active:1
> > > > Failed to transfer wrapper log for job cat-1hg8aoik
> > > > Exception in cat:
> > > > Arguments: [data.txt]
> > > > Host: pbs
> > > > Directory: catsn-20111113-0216-1d35h7eb/jobs/1/cat-1hg8aoik
> > > > stderr.txt:
> > > >
> > > > stdout.txt:
> > > >
> > > > ----
> > > >
> > > > Caused by: Task failed: 1113-160254-000000 Block task ended
> > > > prematurely
> > > >
> > > >
> > > > Final status: time: Sun, 13 Nov 2011 02:17:05 +0000 Failed:1
> > > > The following errors have occurred:
> > > > 1. Task failed: 1113-160254-000000 Block task ended prematurely
> > > >
> > > >
> > > >
> > > > Trying the submit script outside of swift also does not seem to
> > > > be
> > > > working.
> > > > The scripts get submitted to the queue and immediately exits
> > > > without
> > > > writing anything to stdout or stderr.
> > > >
> > > > Were there any recent changes that could have affected this?
> > > >
> > > > I remember to have tried this successfully in the last week of
> > > > last
> > > > month.
> > > >
> > > > Regards,
> > > > --
> > > > Ketan
> > > >
> > >
> > > --
> > > Sent from my mobile device
> > > _______________________________________________
> > > Swift-devel mailing list
> > > Swift-devel at ci.uchicago.edu
> > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> >
> > --
> > Michael Wilde
> > Computation Institute, University of Chicago
> > Mathematics and Computer Science Division
> > Argonne National Laboratory
> >
> >
> >
> >
> >
> > --
> > Ketan
> 
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 
> 
> 
> 
> 
> --
> Ketan

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list