[Swift-devel] Problems running Swift on BG/P

ZHAO ZHANG zhaozhang at uchicago.edu
Tue Feb 28 23:21:14 CST 2012


Hi, Mike, All,

Please refer to 
http://wiki.mcs.anl.gov/zeptoos/index.php/FAQ#How_to_open_a_socket_from_a_CN_to_the_outside_world 
for the NAT feature of ZeptoOS.
It could be enabled in the cqsub command line. Keep in mind that, if we 
use this feature, we have to start a server a the login node, and let 
compute nodes
connect the server socket. Once the server socket got the connection, it 
can send message back.

To access CNs from IO Node, we need to use the tree network, which range 
from 192.168.1.1 to 192.168.1.64. There is a overlay mapping of the tree 
network
and the torus network. But I never figured it out. We could work around 
the problem by login one of the compute nodes, then telnet the torus 
network
address.

An simple example is we could login 192.168.1.64. PS: in any scale, 
192.168.1.68 in the first pset is always the one with Rank 0. From 
there, we could login
12.0.0.2 and etc..

best
zhao

On 2/28/2012 11:07 PM, Michael Wilde wrote:
> Emalayan and I spent a considerable amount of time debugging Swift on surveyor tonight.
>
> As far as I can tell, after fixing a few config problems, it seems like the workers are unable to connect the coaster service. They seem to be trying to connect on the correct address. The workers start, and produce logs, but dont seem to make connections.
>
> I noticed the following email thread:
>    http://lists.ci.uchicago.edu/pipermail/swift-devel/2010-December/007099.html
>
> which talk about the sites attribute "alcfbgpnat" and state:
> ---
> This code snippet may be of relevance:
> if (settings.getAlcfbgpnat()) {
> 	spec.addEnvironmentVariable("ZOID_ENABLE_NAT", "true");
> }
>
> So you should set that env variable for the job if you want NAT.
> ---
>
> Is this being done in the current start-coaster-service job? (Presumably needs to be done in the cobalt job?)
>
> We also noticed that Emalayan was unable to follow the standard recipe for logging into the compute nodes of a running job. He could get to the IOP, but from there, got something like "no route to host" when he tried to telnet (or ping?) to the compute nodes.
>
> I'll check on the ZOID_ENABLE_NAT setting, but any thoughts?
>
> Thanks,
>
> - Mike
>



More information about the Swift-devel mailing list