[Swift-devel] swift pbs/beagle broken

Michael Wilde wilde at mcs.anl.gov
Sun Nov 13 09:28:29 CST 2011


2 thoughts here, Ketan:

- when I tried my manual coaster test, I replaced the options "-n 3 -N 1 -cc none -d 24 -F exclusive" on aprun with simply "-B" which says "use the options from qsub". I was going to go back and see if there was some subtle new mismatch between these qsub and aprun processor-layout options.

- I realized that manually testing the swift-generated submit file will give new errors because the swift service is no longer alive and listening on the port that the worker will try to connect to.  Also, it seemed that the .pl file itself that automatic coaster bootstrap places in ~/.globus/coasters was not there. Im assuming that Swift removes these files when it exits, but need to verify that this is true and that the failure is not due to a missing .pl file.  I suspect that this is normal and is not the problem, but again, we need to keep debugging until the root cause is found.

Mike


----- Original Message -----
> From: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
> To: "Michael Wilde" <wilde at mcs.anl.gov>
> Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Sunday, November 13, 2011 7:20:25 AM
> Subject: Re: [Swift-devel] swift pbs/beagle broken
> I tried with a simple /bin/date command at the end of the submit
> script removing the call to worker.pl :
> 
> 
> 
> #CoG This script generated by CoG
> #CoG by class: class
> org.globus.cog.abstraction.impl.scheduler.pbs.PBSExecutor
> #CoG on date: 2011/11/13 02:16:54
> 
> 
> #PBS -S /bin/bash
> #PBS -N Block-1113-1602
> #PBS -m n
> #PBS -A CI-DEB000002
> #PBS -l mppwidth=3,mppnppn=1,mppdepth=24
> #PBS -l walltime=00:10:00
> #PBS -o
> /home/ketan/.globus/scripts/PBS2583661693904024220.submit.stdout
> #PBS -e
> /home/ketan/.globus/scripts/PBS2583661693904024220.submit.stderr
> WORKER_LOGGING_LEVEL=NONE
> #PBS -v WORKER_LOGGING_LEVEL
> cd / && aprun -n 3 -N 1 -cc none -d 24 -F exclusive /bin/sh -c
> /bin/date
> 
> 
> =======
> 
> 
> This fails too. The queue cancels the job as soon as it starts
> running, without writing anything to stdout or stderr.
> 
> 
> 
> On Sun, Nov 13, 2011 at 12:54 AM, Michael Wilde < wilde at mcs.anl.gov >
> wrote:
> 
> 
> OK, I dont need these; I can reproduce the problem as well.
> 
> For some reason, the coaster worker is exiting immediately.
> 
> I see a few possibilities:
> 
> - Beagle networking may have changed, making it no longer possible to
> reach the coaster service from the compute nodes using the previous IP
> address ranges.
> 
> - the worker.pl script is not being created in $HOME/.globus/coasters
> 
> Mike
> 
> 
> 
> 
> 
> ----- Original Message -----
> > From: "Michael Wilde" < wilde at mcs.anl.gov >
> > To: "Ketan Maheshwari" < ketancmaheshwari at gmail.com >
> > Cc: "Swift Devel" < swift-devel at ci.uchicago.edu >
> > Sent: Saturday, November 12, 2011 8:39:36 PM
> > Subject: Re: [Swift-devel] swift pbs/beagle broken
> > Ketan, can you post the submit script and site file?
> >
> > On 11/12/11, Ketan Maheshwari < ketancmaheshwari at gmail.com > wrote:
> > > Hi,
> > >
> > > It seems the pbs-coaster provider (local:pbs) is broken for swift.
> > > I
> > > tried
> > > swift trunk, 0.93 svn branch, 0.93RC3 and 0.93RC4 but getting the
> > > same
> > > response:
> > >
> > > Swift svn swift-r5205 cog-r3293
> > >
> > > RunID: 20111113-0216-1d35h7eb
> > > Progress: time: Sun, 13 Nov 2011 02:16:54 +0000
> > > site setting workersPerNode has been replaced with jobsPerNode!
> > > Progress: time: Sun, 13 Nov 2011 02:17:05 +0000 Active:1
> > > Failed to transfer wrapper log for job cat-1hg8aoik
> > > Exception in cat:
> > > Arguments: [data.txt]
> > > Host: pbs
> > > Directory: catsn-20111113-0216-1d35h7eb/jobs/1/cat-1hg8aoik
> > > stderr.txt:
> > >
> > > stdout.txt:
> > >
> > > ----
> > >
> > > Caused by: Task failed: 1113-160254-000000 Block task ended
> > > prematurely
> > >
> > >
> > > Final status: time: Sun, 13 Nov 2011 02:17:05 +0000 Failed:1
> > > The following errors have occurred:
> > > 1. Task failed: 1113-160254-000000 Block task ended prematurely
> > >
> > >
> > >
> > > Trying the submit script outside of swift also does not seem to be
> > > working.
> > > The scripts get submitted to the queue and immediately exits
> > > without
> > > writing anything to stdout or stderr.
> > >
> > > Were there any recent changes that could have affected this?
> > >
> > > I remember to have tried this successfully in the last week of
> > > last
> > > month.
> > >
> > > Regards,
> > > --
> > > Ketan
> > >
> >
> > --
> > Sent from my mobile device
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel
> 
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 
> 
> 
> 
> 
> --
> Ketan

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list