[Swift-devel] Problems running Swift on BG/P

Michael Wilde wilde at mcs.anl.gov
Tue Feb 28 23:18:56 CST 2012


I asked Emalayan to set GLOBUS_HOSTNAME to that value.

Its not being set in the sites file.  But somehow that is getting through (I think) because the workers are trying to connect to that address.

The sites file was:

<config>
  <pool handle="persistent-coasters">
    <execution provider="coaster-persistent"
               url="http://172.17.3.12:22356"
               jobmanager="local:local"/>
    <profile namespace="globus" key="workerManager">passive</profile>
    <profile namespace="globus" key="jobsPerNode">4</profile>
    <profile key="jobThrottle" namespace="karajan">1000</profile>
    <profile namespace="karajan" key="initialScore">10000</profile>
    <filesystem provider="local" url="none" />
    <workdirectory>/home/emalayan/work</workdirectory>
  </pool>
</config>

I also see that start-coaster-service is trying to set ZOID_ENABLE_NAT:

  ENV="WORKER_LOGGING_LEVEL=DEBUG:ZOID_ENABLE_NAT=true"
  if [ -n $WORKER_ENVIRONMENT ]; then
     ENV+=:$WORKER_ENVIRONMENT
  fi
  set -x
  cqsub -q ${QUEUE}   \
        -k zeptoos    \
        -t ${MAXTIME} \
        -n ${NODES}   \
        -C ${PWD}/${LOG_DIR} \
	-E cobalt.${$}.stderr \
        -o cobalt.${$}.stdout \
        -e $ENV \
        $SWIFT_BIN/$WORKER $EXECUTION_URL $ID $PWD/$LOG_DIR

Im thinking that one possibility is that without NAT enabled, the workers cant connect back to the login host's 172. network, which is a different subnet than the 172. net of the login host.

Jon, did this mechanism work for you?

Also, is it possible that somehow the ":"-separated envvars are not getting from cqsub to the job's environment? Could something have changed in cobalt in yesterday' maintenance window?

- Mike

----- Original Message -----
> From: "Jonathan Monette" <jonmon at mcs.anl.gov>
> To: "Michael Wilde" <wilde at mcs.anl.gov>
> Cc: "Justin M Wozniak" <wozniak at mcs.anl.gov>, "Jonathan Monette" <jon.monette at gmail.com>, emalayan at ece.ubc.ca, "Matei
> Ripeanu" <matei at ece.ubc.ca>, "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Tuesday, February 28, 2012 11:09:28 PM
> Subject: Re: [Swift-devel] Problems running Swift on BG/P
> Is the internalHostname variable being set in the sites file? It
> should be set to the 172.*.* address returned from ifconfig
> 
> On Feb 28, 2012, at 11:07 PM, Michael Wilde <wilde at mcs.anl.gov> wrote:
> 
> > Emalayan and I spent a considerable amount of time debugging Swift
> > on surveyor tonight.
> >
> > As far as I can tell, after fixing a few config problems, it seems
> > like the workers are unable to connect the coaster service. They
> > seem to be trying to connect on the correct address. The workers
> > start, and produce logs, but dont seem to make connections.
> >
> > I noticed the following email thread:
> >  http://lists.ci.uchicago.edu/pipermail/swift-devel/2010-December/007099.html
> >
> > which talk about the sites attribute "alcfbgpnat" and state:
> > ---
> > This code snippet may be of relevance:
> > if (settings.getAlcfbgpnat()) {
> >    spec.addEnvironmentVariable("ZOID_ENABLE_NAT", "true");
> > }
> >
> > So you should set that env variable for the job if you want NAT.
> > ---
> >
> > Is this being done in the current start-coaster-service job?
> > (Presumably needs to be done in the cobalt job?)
> >
> > We also noticed that Emalayan was unable to follow the standard
> > recipe for logging into the compute nodes of a running job. He could
> > get to the IOP, but from there, got something like "no route to
> > host" when he tried to telnet (or ping?) to the compute nodes.
> >
> > I'll check on the ZOID_ENABLE_NAT setting, but any thoughts?
> >
> > Thanks,
> >
> > - Mike
> >
> > --
> > Michael Wilde
> > Computation Institute, University of Chicago
> > Mathematics and Computer Science Division
> > Argonne National Laboratory
> >
> > _______________________________________________
> > Swift-devel mailing list
> > Swift-devel at ci.uchicago.edu
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list