[Swift-devel] Re: Swift running errors

Michael Wilde wilde at mcs.anl.gov
Tue Feb 19 15:52:48 CST 2008


Xi,

Regarding the kickstart problem - this is just a warning, possibly due 
to an incorrect spec in your sites.xml file on where kickstart is 
installed.  We can look into this.

Regarding "too many open files" - its possible that swift is trying to 
run too much in parallel and thus opening too many files at once. 
Mihael or Ben, could this be due to lack of or incorrect setting of the 
throttling parameters? I cant tell if this is hitting a per-host or 
per-process limit, but I suspect its the latter.

Xi, until you hear from others, look at the throttling parameters and 
set them to a modest value to start with. I need to go back to my notes 
for this - and we should document this more clearly in the user guide.

- mike


On 2/19/08 3:00 PM, lixi at uchicago.edu wrote:
> Hi,
> 
> I have two problems. 
> 
> 1. Today, when I try to run swift workflow on muliple OSG 
> sites, I always encounter the following errors which cause 
> the running failed:
> [lixi at login remote]$ swift -
> tc.file /home/lixi/swift/test/tc.data -
> sites.file /home/lixi/swift/test/OSGEDU_Sites.xml 
> workflowtest.swift 
> Swift v0.3-dev r1674 (modified locally)
> 
> RunID: 20080219-1447-1hztqje9
> node started
> Failed to transfer kickstart records from workflowtest-
> 20080219-1447-1hztqje9/kickstart/8/CIT_CMS_T2Exception in 
> getFile
>         task:transfer @ vdl-int.k, line: 322
>         sys:try @ vdl-int.k, line: 322
>         vdl:transferkickstartrec @ vdl-int.k, line: 409
>         sys:set @ vdl-int.k, line: 409
>         sys:sequential @ vdl-int.k, line: 409
>         sys:try @ vdl-int.k, line: 408
>         sys:else @ vdl-int.k, line: 407
>         sys:if @ vdl-int.k, line: 405
>         sys:set @ vdl-int.k, line: 404
>         sys:catch @ vdl-int.k, line: 396
>         sys:try @ vdl-int.k, line: 354
>         task:allocatehost @ vdl-int.k, line: 334
>         vdl:execute2 @ execute-default.k, line: 23
>         sys:restartonerror @ execute-default.k, line: 21
>         sys:sequential @ execute-default.k, line: 19
>         sys:try @ execute-default.k, line: 18
>         sys:if @ execute-default.k, line: 17
>         sys:then @ execute-default.k, line: 16
>         sys:if @ execute-default.k, line: 15
>         vdl:execute @ workflowtest.kml, line: 31
>         worknode @ workflowtest.kml, line: 79
>         sys:sequential @ workflowtest.kml, line: 78
>         sys:parallel @ workflowtest.kml, line: 77
>         vdl:mainp @ workflowtest.kml, line: 76
>         mainp @ vdl.k, line: 150
>         vdl:mains @ workflowtest.kml, line: 75
>         vdl:mains @ workflowtest.kml, line: 75
>         rlog:restartlog @ workflowtest.kml, line: 74
>         kernel:project @ workflowtest.kml, line: 2
>         workflowtest-20080219-1447-1hztqje9
> Caused by: 
> org.globus.cog.abstraction.impl.file.FileResourceException: 
> Exception in getFile
> Caused by: org.globus.ftp.exception.ServerException: Server 
> refused performing the request. Custom message:  (error code 
> 1) [Nested exception message:  Custom message: Unexpected 
> reply: 500-Command failed. : 
> globus_gridftp_server_file.c:globus_l_gfs_file_send:2190:
> 500-globus_l_gfs_file_open failed.
> 500-globus_gridftp_server_file.c:globus_l_gfs_file_open:1694:
> 500-globus_xio_register_open failed.
> 500-globus_xio_file_driver.c:globus_l_xio_file_open:438:
> 500-Unable to open file /raid2/osg-data/lixi/workflowtest-
> 20080219-1447-1hztqje9/kickstart/8/node-8kgjdnoi-
> kickstart.xml
> 500-globus_xio_file_driver.c:globus_l_xio_file_open:381:
> 500-System error in open: No such file or directory
> 500-globus_xio: A system call failed: No such file or 
> directory
> 500 End.] [Nested exception is 
> org.globus.ftp.exception.UnexpectedReplyCodeException:  
> Custom message: Unexpected reply: 500-Command failed. : 
> globus_gridftp_server_file.c:globus_l_gfs_file_send:2190:
> 500-globus_l_gfs_file_open failed.
> 500-globus_gridftp_server_file.c:globus_l_gfs_file_open:1694:
> 500-globus_xio_register_open failed.
> 500-globus_xio_file_driver.c:globus_l_xio_file_open:438:
> 500-Unable to open file /raid2/osg-data/lixi/workflowtest-
> 20080219-1447-1hztqje9/kickstart/8/node-8kgjdnoi-
> kickstart.xml
> 500-globus_xio_file_driver.c:globus_l_xio_file_open:381:
> 500-System error in open: No such file or directory
> 500-globus_xio: A system call failed: No such file or 
> directory
> 500 End.]
> 
> 2. When runing a workflow which involves 1000nodes, I 
> encounter the following errors very frequently, but not all 
> the time:
> ...
> node completed
> node completed
> node completed
> node completed
> node completed
> node failed
> Execution failed:
>         Exception in node:
> Arguments: [_concurrent/intermediatefile-b5b5dc39-df70-4137-
> 8149-c20f5d1af839-, out.0132.txt]
> Host: localhost
> Directory: workflowtest-20080219-1443-2qx4ctkc/jobs/6/node-
> 64kddnoi
> stderr.txt: 
> 
> stdout.txt: 
> 
> ----
> 
> Caused by:
>         java.io.IOException: Too many open files
> 
> Could you tell me why and teach me how to resolve such 
> problems? 
> 
> Thanks,
> 
> Xi
> 
> 



More information about the Swift-devel mailing list