[Swift-user] Coaster Service Startup Time on OSDC

Matthew Shaxted Matthew.Shaxted at som.com
Wed Feb 19 12:50:47 CST 2014


I suppose the VM heap space issue on the cluster head node is due to 32bit ubuntu OS...


From: Matthew Shaxted
Sent: Wednesday, February 19, 2014 12:38 PM
To: 'David Kelly'
Cc: Wilde, Michael J.; swift-user at ci.uchicago.edu
Subject: RE: [Swift-user] Coaster Service Startup Time on OSDC

Hi David,

I tried out the cluster and it does seem to make the start-coaster-service much faster - however, I am getting VM heap space issues when starting swift on the configuration I am using so it is making it difficult to use the cluster head node.

Also, I have many of my workflows setup to read from the glusterfs so I may just stick with the slow coaster start for now. The bigger issue is that even when these start, they do not remain persistent for some reason (could be my localhost coasters). Do I need to explicitly express my localhost coasters as persistent in the sites.xml file?

I have been getting strange issues when I try to scale my runs (10-12 hrs each) on the glusterfs configuration - they seem to be failing midway through for some reason although they work with smaller runs, so I'm going to put sometime into understanding why this is now.

Thanks,
Matthew


MATTHEW SHAXTED

SKIDMORE, OWINGS & MERRILL LLP
224 South Michigan Ave.
Chicago, IL 60604
TEL: 312.360.4368
FAX: 312.360.4545
matthew.shaxted at som.com<mailto:matthew.shaxted at som.com>

[cid:image001.png at 01CF2D71.35CF0DB0]
WWW.SOM.COM<http://www.som.com/>

The information contained in this communication may be confidential, is intended only for the use of the recipient(s) named above, and may be legally privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited and may be unlawful. If you have received this communication in error, please return it to the sender immediately and delete the original message and any copy of it from your computer system. If you have any questions concerning this message, please contact the sender.

[cid:image002.png at 01CF2D71.35CF0DB0]

From: David Kelly [mailto:davidkelly at uchicago.edu]
Sent: Monday, February 17, 2014 7:21 PM
To: Matthew Shaxted
Cc: Wilde, Michael J.; swift-user at ci.uchicago.edu<mailto:swift-user at ci.uchicago.edu>
Subject: Re: [Swift-user] Coaster Service Startup Time on OSDC

Sounds good, Matthew. Let me know how that works for you. I filed a ticket with osdc support about the SSH issue, so hopefully they can offer some help there.

I added an entry to our site guide documentation about how to run on OSDC in cluster mode. It is at http://swiftlang.org/guides/release-0.94/siteguide/siteguide.html#_open_science_data_grid. The only potential issue is, it seems to only work with the standard Ubuntu images.

On Mon, Feb 17, 2014 at 3:34 PM, Matthew Shaxted <Matthew.Shaxted at som.com<mailto:Matthew.Shaxted at som.com>> wrote:
Hi David,

Indeed I am running start-coaster-service from the head node.

The first recommendation is a good one, I will try this out and let you know how it works.

I am also very interested in the cluster launch/PBS scheduler approach, although I have never used it before. An example OSDC/PBS config would be really helpful.

Thanks,
Matthew


MATTHEW SHAXTED

SKIDMORE, OWINGS & MERRILL LLP
224 South Michigan Ave.
Chicago, IL 60604
TEL: 312.360.4368<tel:312.360.4368>
FAX: 312.360.4545<tel:312.360.4545>
matthew.shaxted at som.com<mailto:matthew.shaxted at som.com>

[cid:image001.png at 01CF2D71.35CF0DB0]
WWW.SOM.COM<http://www.som.com/>

The information contained in this communication may be confidential, is intended only for the use of the recipient(s) named above, and may be legally privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited and may be unlawful. If you have received this communication in error, please return it to the sender immediately and delete the original message and any copy of it from your computer system. If you have any questions concerning this message, please contact the sender.

[cid:image002.png at 01CF2D71.35CF0DB0]

From: Wilde, Michael J. [mailto:wilde at anl.gov<mailto:wilde at anl.gov>]
Sent: Monday, February 17, 2014 2:36 PM
To: David Kelly; Matthew Shaxted
Cc: swift-user at ci.uchicago.edu<mailto:swift-user at ci.uchicago.edu>
Subject: RE: [Swift-user] Coaster Service Startup Time on OSDC

Good find, David. Did you file a ticket on the slowness with OSDC Support?

When you run ssh -vvv, does the timing of log output suggest where the problem is?

Can you run into some tool like typescript, or a "screen" log, that will timestamp the records, and send those to OSDC Support?

- Mike
--
Michael Wilde
Mathematics and Computer Science          Computation Institute
Argonne National Laboratory                    The University of Chicago

________________________________
From: swift-user-bounces at ci.uchicago.edu<mailto:swift-user-bounces at ci.uchicago.edu> [swift-user-bounces at ci.uchicago.edu<mailto:swift-user-bounces at ci.uchicago.edu>] on behalf of David Kelly [davidkelly at uchicago.edu<mailto:davidkelly at uchicago.edu>]
Sent: Monday, February 17, 2014 2:14 PM
To: Matthew Shaxted
Cc: swift-user at ci.uchicago.edu<mailto:swift-user at ci.uchicago.edu>
Subject: Re: [Swift-user] Coaster Service Startup Time on OSDC
Hi Matthew,

I set up a test on OSDC with 10 nodes. I did notice something strange there. When I try to SSH from the Sullivan head node to one of my CentOS instances, it takes much longer than it should:

dkelly at kg14-compute-1:~$ time ssh root at 172.16.1.236<mailto:root at 172.16.1.236> ls
Warning: Permanently added '172.16.1.236' (RSA) to the list of known hosts.
anaconda-ks.cfg
install.log
install.log.syslog

real 0m25.197s

Are you running start-coaster-service from the Sullivan head node? For each node in your node list, there are three SSH commands (one to create a directory structure, one to scp the worker.pl<http://worker.pl/> script there, and one to launch worker.pl<http://worker.pl/>). This is done serially in the 0.94 branch. I can see how this would take very long. I will make some changes to speed up start-coaster-service, but in the meantime, here are a few suggestions:

1. The SSH slowness seems to only be from the Sullivan head node to the VMs. SSH connections from one VM to another VM is pretty quick. Are you able to run Swift and start-coaster-service on a VM?

2. Is a persistent coasters setup needed here? OSDC has an option to launch instances as a cluster, which makes available a PBS scheduler. You could set this up and avoid the need to start and manage workers yourself.

Let me know what you think. I have some example OSDC/PBS configs if you decide to go that route.

Thanks,
David

On Thu, Feb 13, 2014 at 1:04 PM, Matthew Shaxted <Matthew.Shaxted at som.com<mailto:Matthew.Shaxted at som.com>> wrote:
Sure thing David,

I have a setup.sh script that exports WORKER_HOSTS IP addresses from a txt file (see attached) - all nodes are running centOS.

The conf file I am using is also attached. I'm using swift-0.94.1.

Many thanks,
Matthew


MATTHEW SHAXTED

SKIDMORE, OWINGS & MERRILL LLP
224 South Michigan Ave.
Chicago, IL 60604
TEL: 312.360.4368<tel:312.360.4368>
FAX: 312.360.4545<tel:312.360.4545>
matthew.shaxted at som.com<mailto:matthew.shaxted at som.com>

[cid:image001.png at 01CF2D71.35CF0DB0]
WWW.SOM.COM<http://www.som.com/>

The information contained in this communication may be confidential, is intended only for the use of the recipient(s) named above, and may be legally privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited and may be unlawful. If you have received this communication in error, please return it to the sender immediately and delete the original message and any copy of it from your computer system. If you have any questions concerning this message, please contact the sender.

[cid:image002.png at 01CF2D71.35CF0DB0]

From: David Kelly [mailto:davidkelly at uchicago.edu<mailto:davidkelly at uchicago.edu>]
Sent: Thursday, February 13, 2014 12:53 PM
To: Matthew Shaxted
Cc: swift-user at ci.uchicago.edu<mailto:swift-user at ci.uchicago.edu>
Subject: Re: [Swift-user] Coaster Service Startup Time on OSDC

Hi Matthew,

Could you please explain a little more about how you're starting the coaster-service and workers? Are you using the start-coaster-service script? If you are, could you please send the coaster-service.conf file you're using? Which version of Swift is this?

There may be some things we can do to speed up the process - just need to get a better understanding of how things are set up and where the delays are coming from. Thanks!

Regards,
David

On Thu, Feb 13, 2014 at 12:23 PM, Matthew Shaxted <Matthew.Shaxted at som.com<mailto:Matthew.Shaxted at som.com>> wrote:
Dear Swift User Group:

I working with a 60-node cluster running on OSDC now, and when I try to start the Swift coaster-service for these nodes, it takes about 30 seconds (or more) per node to successfully start.

This is an issue for me as it limits how often I want to shut down the coaster-service - for this 60 node cluster it could take up to 30 min to start up again.

Is this starting coaster behavior normal? Is there anything I can do to make the coaster-service start faster?

Thanks,
Matthew


MATTHEW SHAXTED

SKIDMORE, OWINGS & MERRILL LLP
224 South Michigan Ave.
Chicago, IL 60604
TEL: 312.360.4368<tel:312.360.4368>
FAX: 312.360.4545<tel:312.360.4545>
matthew.shaxted at som.com<mailto:matthew.shaxted at som.com>

[cid:image001.png at 01CF2D71.35CF0DB0]
WWW.SOM.COM<http://www.som.com/>

The information contained in this communication may be confidential, is intended only for the use of the recipient(s) named above, and may be legally privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited and may be unlawful. If you have received this communication in error, please return it to the sender immediately and delete the original message and any copy of it from your computer system. If you have any questions concerning this message, please contact the sender.

[cid:image002.png at 01CF2D71.35CF0DB0]


_______________________________________________
Swift-user mailing list
Swift-user at ci.uchicago.edu<mailto:Swift-user at ci.uchicago.edu>
https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20140219/a30c8524/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 6643 bytes
Desc: image001.png
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20140219/a30c8524/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.png
Type: image/png
Size: 3047 bytes
Desc: image002.png
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20140219/a30c8524/attachment-0001.png>


More information about the Swift-user mailing list