[Swift-devel] Problems running Swift on BG/P

Michael Wilde wilde at mcs.anl.gov
Wed Feb 29 00:43:04 CST 2012


Thanks, Zhao.  In this case we are using start-coaster-service, which does start a service on the login nodes.  Its a procedure that has been tested and has worked for Justin.  Buts its failing for Emalayan and I think Jon just verified that it is failing for him as well. This script does set ZOID_ENABLE_NAT via the cqsub -e option.

Ive just verified that in at least a simple cqsub model on what start-coaster-service uses, that with ZOID_ENABLE_NAT=true I am able to ping the login host, and with that variable not set, I can not.  I also tested with that variable set in between two other var settings, sandwiched between :'s, as it is in start-coaster-service, then NAT still works:

/usr/bin/cqsub.py -q default -p MTCScienceApps -k zeptoos -t 60 -n 1 -C /home/wilde -E cobalt.17074.stderr -o cobalt.17074.stdout -e WORKER_LOGGING_LEVEL=debug:ZOID_ENABLE_NAT=true:WORKER_INIT_CMD=/home/wilde/bin/worker-init.pl /bin/ping -c 5 172.17.3.12
Command: '/bgsys/drivers/ppcfloor/bin/mpirun' '-host' '172.17.3.1' '-np' '1' '-partition' 'ANL-R00-M1-N02-64' '-mode' 'smp' '-cwd' '/home/wilde' '-exe' '/bin/ping' '-args' '-c 5 172.17.3.12' '-env' 'COBALT_JOBID=273236 WORKER_LOGGING_LEVEL=debug WORKER_INIT_CMD=/home/wilde/bin/worker-init.pl ZOID_ENABLE_NAT=true'

So the behavior we are seeing suggests that somehow in Emalayan's tests, the ZOID_ENABLD_NAT setting is not getting through.

Next I think we need to re-create the problem using the exact scripts and environment, conf, etc that Emalayan is using, and then debug it form there, ideally snapping the cqsub it uses and testing with just that to start with.

Jon said he will do this in the morning, and I think we can nail the problem then.

- Mike




----- Original Message -----
> From: "ZHAO ZHANG" <zhaozhang at uchicago.edu>
> To: "Michael Wilde" <wilde at mcs.anl.gov>
> Cc: "Justin M Wozniak" <wozniak at mcs.anl.gov>, "Jonathan Monette" <jon.monette at gmail.com>, emalayan at ece.ubc.ca, "Matei
> Ripeanu" <matei at ece.ubc.ca>, "Swift Devel" <swift-devel at ci.uchicago.edu>
> Sent: Tuesday, February 28, 2012 11:21:14 PM
> Subject: Re: [Swift-devel] Problems running Swift on BG/P
> Hi, Mike, All,
> 
> Please refer to
> http://wiki.mcs.anl.gov/zeptoos/index.php/FAQ#How_to_open_a_socket_from_a_CN_to_the_outside_world
> for the NAT feature of ZeptoOS.
> It could be enabled in the cqsub command line. Keep in mind that, if
> we
> use this feature, we have to start a server a the login node, and let
> compute nodes
> connect the server socket. Once the server socket got the connection,
> it
> can send message back.
> 
> To access CNs from IO Node, we need to use the tree network, which
> range
> from 192.168.1.1 to 192.168.1.64. There is a overlay mapping of the
> tree
> network
> and the torus network. But I never figured it out. We could work
> around
> the problem by login one of the compute nodes, then telnet the torus
> network
> address.
> 
> An simple example is we could login 192.168.1.64. PS: in any scale,
> 192.168.1.68 in the first pset is always the one with Rank 0. From
> there, we could login
> 12.0.0.2 and etc..
> 
> best
> zhao
> 
> On 2/28/2012 11:07 PM, Michael Wilde wrote:
> > Emalayan and I spent a considerable amount of time debugging Swift
> > on surveyor tonight.
> >
> > As far as I can tell, after fixing a few config problems, it seems
> > like the workers are unable to connect the coaster service. They
> > seem to be trying to connect on the correct address. The workers
> > start, and produce logs, but dont seem to make connections.
> >
> > I noticed the following email thread:
> >    http://lists.ci.uchicago.edu/pipermail/swift-devel/2010-December/007099.html
> >
> > which talk about the sites attribute "alcfbgpnat" and state:
> > ---
> > This code snippet may be of relevance:
> > if (settings.getAlcfbgpnat()) {
> > 	spec.addEnvironmentVariable("ZOID_ENABLE_NAT", "true");
> > }
> >
> > So you should set that env variable for the job if you want NAT.
> > ---
> >
> > Is this being done in the current start-coaster-service job?
> > (Presumably needs to be done in the cobalt job?)
> >
> > We also noticed that Emalayan was unable to follow the standard
> > recipe for logging into the compute nodes of a running job. He could
> > get to the IOP, but from there, got something like "no route to
> > host" when he tried to telnet (or ping?) to the compute nodes.
> >
> > I'll check on the ZOID_ENABLE_NAT setting, but any thoughts?
> >
> > Thanks,
> >
> > - Mike
> >

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list