[Swift-devel] Persistent coasters on OSG Swift not getting started cores

Sat Sep 10 09:30:34 CDT 2011

Mihael, I agree with your assessment.

Ketan, to enable worker logs: the run-worker.sh script tries to to this. You need to verify that it is correctly setting worker.pl to log. For efficiency it places the worker log on the worker's local tmp filesystem. The trick is getting the log file back via Condor-G. The current run-worker.sh script tails the worker log (and any other files that happen to get created in the log dir) to stdout for Conndor to ship it back.

You should adjust run-worker.sh to ship the log file back in its entirety (or just increase the tail to something much larger).

- Mike

----- Original Message -----
> From: "Mihael Hategan" <hategan at mcs.anl.gov>
> To: "Ketan Maheshwari" <ketancmaheshwari at gmail.com>
> Cc: "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Friday, September 9, 2011 9:38:34 PM
> Subject: Re: [Swift-devel] Persistent coasters on OSG Swift not getting started cores
> There seem to be lots of errors in that log, but a lot of them have to
> do with workers failing for unknown reasons.
> 
> This is no different than what you mentioned before. So we really need
> to troubleshoot that. So please enable worker logging and collect
> worker
> logs.
> 
> On Fri, 2011-09-09 at 11:52 -0500, Ketan Maheshwari wrote:
> > Hi Mihael, All,
> >
> >
> > I am trying to run the DSSAT workflow, a simple one process
> > catsn-like
> > loop.
> >
> >
> > The setup on OSG is persisten coasters based with the following
> > elements:
> >
> >
> > 1. A coaster service is started on the head node
> > 2. Workers are started on OSG sites. I am using 11 OSG sites.
> > 3. The workers are submitted in the form of condor jobs which
> > connect
> > back to the service running at the headnode.
> > 4. In the current instance that I am running, 500 workers are
> > submitted to start, out of which 280 workers are in running state as
> > of now.
> >
> >
> > My throttles: jobthrottle, foreach throttle are set to run 500 tasks
> > at a time.
> >
> >
> > However, I am seeing a see-saw pattern of active tasks whose peak is
> > very low. What I am seeing is: the number of active tasks start
> > rising
> > gradually from 0 to about 30 followed by a decrease from 30 to 0 and
> > back to 30.
> >
> >
> > The logs and sources are
> > at : http://ci.uchicago.edu/~ketan/DSSAT-logs.tgz
> >
> >
> > This tarball contains the following:
> >
> >
> > DSSAT-logs/sites.grid-ps.xml
> > DSSAT-logs/tc-provider-staging
> > DSSAT-logs/cf.ps
> > DSSAT-logs/RunDSSAT.swift
> >
> >
> > Condor, swift logs
> >
> >
> > DSSAT-logs/condor.log
> > DSSAT-logs/swift.log
> >
> >
> > Service and worker's stdouts
> >
> >
> > DSSAT-logs/service-0.out
> > DSSAT-logs/swift-workers.out
> >
> >
> > Three runlogs since the run was resumed twice:
> >
> >
> > DSSAT-logs/RunDSSAT-20110909-1025-hjcelum9.log
> > DSSAT-logs/RunDSSAT-20110909-1030-jjefp0sb.log
> > DSSAT-logs/RunDSSAT-20110909-0918-0hk7ign5.log
> >
> >
> > Any insights would be helpful.
> >
> >
> > Regards,
> > --
> > Ketan
> >
> >
> >
> 
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory