[Swift-user] Swift on local resources

Andriy Fedorov fedorov at cs.wm.edu
Fri Jun 12 16:16:40 CDT 2009


On Fri, Jun 12, 2009 at 4:53 PM, Allan
Espinosa<aespinosa at cs.uchicago.edu> wrote:
> 2009/6/12 Andriy Fedorov <fedorov at cs.wm.edu>:
>> My ~/.ssh/auth.defaults is this:
>>
>> george.bwh.harvard.edu.type=key
>> george.bwh.harvard.edu.username=fedorov
>> george.bwh.harvard.edu.key=/home/fedorov/.ssh/identity.pub
>
> is identity.pub your public key?  This entry should refer to your private key.
>

Ah, ok ... Now I have this error, which makes more sense:

Swift 0.9 swift-r2860 cog-r2388

RunID: 20090612-1714-8od3dmb0
Progress:
Progress:  Initializing site shared directory:1
Progress:  Initializing site shared directory:1
Progress:  Initializing site shared directory:1
Execution failed:
        Could not initialize shared directory on spl_george
Caused by:
        org.globus.cog.abstraction.impl.file.FileResourceException:
Error while communicating with the SSH server on
george.bwh.harvard.edu:22
Caused by:
        Public Key Authentication failed



>> george.bwh.harvard.edu.passphrase=****
>>
>> But what I am saying is that passphrase login is not working even when
>> I do plain ssh -- it is asking me to enter the password.
>
> I see. is your public key on the ~/.ssh/authorized_keys file of the remote host?
>>

Yes, I think I followed the instructions precisely.

Something is wrong with the system, because I was able to set up
passphrase access to the cluster head node, but not between the nodes
that share ~/.ssh (/home is NFS-mounted). Other people in the lab had
same difficulties with ssh keys, so I am afraid it's not something
obvious.



>> Here are the error messages I am getting trying to run a simple test:
>>
>> Swift 0.9 swift-r2860 cog-r2388
>>
>> RunID: 20090612-1643-tlksdhh5
>> Progress:
>> Progress:  Initializing site shared directory:1
>> Progress:  Initializing site shared directory:1
>> Progress:  Initializing site shared directory:1
>> Execution failed:
>>        Could not initialize shared directory on spl_george
>> Caused by:
>>        org.globus.cog.abstraction.impl.file.FileResourceException:
>> Error while communicating with the SSH server on
>> george.bwh.harvard.edu:22
>> Caused by:
>>        java.lang.NullPointerException
>> Caused by:
>>        java.lang.NullPointerException
>>        at java.lang.StringBuffer.<init>(StringBuffer.java:104)
>>        at com.sshtools.j2ssh.openssh.PEMReader.read(PEMReader.java:117)
>>        at com.sshtools.j2ssh.openssh.PEMReader.<init>(PEMReader.java:61)
>>        at com.sshtools.j2ssh.openssh.OpenSSHPrivateKeyFormat.isFormatted(OpenSSHPrivateKeyFormat.java:205)
>>        at com.sshtools.j2ssh.transport.publickey.SshPrivateKeyFile.parse(SshPrivateKeyFile.java:132)
>>        at com.sshtools.j2ssh.transport.publickey.SshPrivateKeyFile.parse(SshPrivateKeyFile.java:171)
>>        at org.globus.cog.abstraction.impl.ssh.Ssh.connect(Ssh.java:254)
>>        at org.globus.cog.abstraction.impl.ssh.SSHConnectionBundle$Connection.ensureConnected(SSHConnectionBundle.java:234)
>>        at org.globus.cog.abstraction.impl.ssh.SSHConnectionBundle.allocateChannel(SSHConnectionBundle.java:76)
>>        at org.globus.cog.abstraction.impl.ssh.SSHChannelManager.getChannel(SSHChannelManager.java:71)
>>        at org.globus.cog.abstraction.impl.ssh.file.FileResourceImpl.start(FileResourceImpl.java:81)
>>        at org.globus.cog.abstraction.impl.file.FileResourceCache.getResource(FileResourceCache.java:98)
>>        at org.globus.cog.abstraction.impl.file.CachingDelegatedFileOperationHandler.getResource(CachingDelegatedFileOperationHandler.java:75)
>>        at org.globus.cog.abstraction.impl.file.CachingDelegatedFileOperationHandler.submit(CachingDelegatedFileOperationHandler.java:40)
>>        at org.globus.cog.abstraction.impl.common.task.CachingFileOperationTaskHandler.submit(CachingFileOperationTaskHandler.java:28)
>>        at org.globus.cog.karajan.scheduler.submitQueue.NonBlockingSubmit.run(NonBlockingSubmit.java:86)
>>        at edu.emory.mathcs.backport.java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:431)
>>        at edu.emory.mathcs.backport.java.util.concurrent.FutureTask.run(FutureTask.java:166)
>>        at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:643)
>>        at edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:668)
>>        at java.lang.Thread.run(Thread.java:595)
>>
>>
>> On Fri, Jun 12, 2009 at 4:17 PM, Andriy Fedorov<fedorov at cs.wm.edu> wrote:
>>> Allan,
>>>
>>> Thank you for the example.
>>>
>>> I have exactly same setup, but it doesn't work for me. I suspect the
>>> reason is that I am unable to set up my environment to work with
>>> passphrase. I can only log in with password. I wonder if there is any
>>> workaround...
>>>
>>> AF
>>>
>>>
>>>
>>> On Fri, Jun 12, 2009 at 4:04 PM, Allan
>>> Espinosa<aespinosa at cs.uchicago.edu> wrote:
>>>> Hi Andriy and Mike,
>>>>
>>>> here is my example ~/.ssh/auth.defaults for executing jobs on
>>>> tp-login1.ci.uchicago.edu:
>>>>
>>>> [aespinosa at tp-login2 ~]$ cat .ssh/auth.defaults
>>>> tp-login1.ci.uchicago.edu.type=key
>>>> tp-login1.ci.uchicago.edu.username=aespinosa
>>>> tp-login1.ci.uchicago.edu.key=/home/aespinosa/.ssh/id_dsa
>>>> tp-login1.ci.uchicago.edu.passphrase=XXXXXXXX
>>>>
>>>> We have used falkon and coasters before for multi-core configurations.
>>>>  but for a single multi-core machine, i believe you can get away with
>>>> having multiple entries of the same host in the sites.xml file using
>>>> the ssh-provider.
>>>>
>>>> ie:
>>>>
>>>> <config>
>>>> <pool handle="CORE0">
>>>>   <execution provider="ssh"... />
>>>>   ...
>>>> </pool>
>>>> <pool handle="CORE1">
>>>>   ...
>>>> </pool>
>>>> </config>
>>>>
>>>> 2009/6/12 Michael Wilde <wilde at mcs.anl.gov>:
>>>>> Ah, very cool. Im eager to get more user experience feedback on multicore
>>>>> use.
>>>>>
>>>>> So I will try to hunt down my examples of .ssh configs.
>>>>>
>>>>> Also, Allan Espinosa used this recently. Allan, can you post details and
>>>>> examples?
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Mike
>>>>>
>>>>> On 6/12/09 2:06 PM, Andriy Fedorov wrote:
>>>>>>
>>>>>> Michael,
>>>>>>
>>>>>> Thank you for the advice, I will look into this. This is very helpful.
>>>>>> I had an impression Lava is not included in the list of schedulers
>>>>>> supported out of the box, but wanted to check.
>>>>>>
>>>>>> Just a clarification -- I need to access two different types of local
>>>>>> resources. Cluster (via Lava or Condor) is one, but for the multicore
>>>>>> nodes we have on the network, which are not part of cluster, the only
>>>>>> option is to use ssh.
>>>>>>
>>>>>> AF
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Jun 12, 2009 at 2:52 PM, Michael Wilde<wilde at mcs.anl.gov> wrote:
>>>>>>>
>>>>>>> Andriy,
>>>>>>>
>>>>>>> Ben or Mihael may have better ideas, but I offer my thoughts below.
>>>>>>>
>>>>>>> On 6/12/09 1:18 PM, Andriy Fedorov wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I am trying to set up Swift with the local cluster and non-cluster
>>>>>>>> resources in our lab. Here some configuration details.
>>>>>>>>
>>>>>>>> Due to technical problems, passphrase login is not possible for the
>>>>>>>> nodes on local network, and I need to enter password each time.
>>>>>>>>
>>>>>>>> For the cluster, I was able to set up passphrase login for the head
>>>>>>>> node. The cluster is running Lava and Condor schedulers at the same
>>>>>>>> time, but Lava should be used if possible.
>>>>>>>>
>>>>>>>> Two questions:
>>>>>>>>
>>>>>>>> (1) is it possible to configure Swift to talk to Lava scheduler?
>>>>>>>
>>>>>>> Making Swift talk to a new scheduler means writing a new CoG provider (in
>>>>>>> Java). You can likely use an existing "data" provider like "local"; you
>>>>>>> could model the "execution" provider after the "PBS" provider. How hard
>>>>>>> this
>>>>>>> is depends on how close Lava is to PBS in nature. (I dont know it). And
>>>>>>> the
>>>>>>> provider interface you need to code to is not well documented afaik.
>>>>>>>
>>>>>>> I would try the Condor provider. While that provider is less mature and
>>>>>>> tested than others, it should work, and if it doesnt, we should try to
>>>>>>> fix
>>>>>>> it.
>>>>>>>
>>>>>>> If possible, make sure a simple condor_submit hello-world works for you
>>>>>>> first.
>>>>>>>
>>>>>>> Run swift on the head/login node; use the "local" data provider.
>>>>>>>
>>>>>>> Another route is to use Falkon, but that will be harder and its less
>>>>>>> supported, so I suggest against this until easier routes are exhausted.
>>>>>>>
>>>>>>> I dont think that ssh will get you far, as to leverage the cluster I
>>>>>>> think
>>>>>>> you'd need to describe each worker node with a separate sites.xml entry.
>>>>>>> Thats fine in principle, but a bit awkward, and may have scheduling
>>>>>>> issues
>>>>>>> (ie if ssh hangs or dies when you dont own the node).
>>>>>>>
>>>>>>> Save ssh as another last resort; I suggest trying Condor first.
>>>>>>>
>>>>>>> If needed, people who used ssh recently can send you the info below.
>>>>>>>
>>>>>>> - Mike
>>>>>>>
>>>>>>>> (2) I am following the instructions on setting up ssh site provider to
>>>>>>>> use nodes on the local network.
>>>>>>>>  (2.1) do I need to set up auth.defaults even if I have ssh-agent
>>>>>>>> running, and can ssh to the remote node without being asked for
>>>>>>>> password?
>>>>>>>>  (2.2.) can anybody give me more detailed instructions on how to set
>>>>>>>> up auth.defaults? I cannot make it work.
>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> Andriy Fedorov
>>>>>>>> _______________________________________________
>>>>
>>>>
>>>>
>>>> --
>>>> Allan M. Espinosa <http://allan.88-mph.net/blog>
>>>> PhD student, Computer Science
>>>> University of Chicago <http://people.cs.uchicago.edu/~aespinosa>
>>>>
>>>
>>
>>
>
>
>
> --
> Allan M. Espinosa <http://allan.88-mph.net/blog>
> PhD student, Computer Science
> University of Chicago <http://people.cs.uchicago.edu/~aespinosa>
>



More information about the Swift-user mailing list