From cphillips at mcs.anl.gov Sat Sep 1 12:29:15 2012 From: cphillips at mcs.anl.gov (Carolyn Phillips) Date: Sat, 1 Sep 2012 12:29:15 -0500 Subject: [Swift-user] Output Files, ReadData and Order of Execution In-Reply-To: <1346462868.17915.0.camel@blabla> References: <12C4B0C3-95C6-42B4-BC95-B07588359E58@mcs.anl.gov> <1346458672.11980.1.camel@blabla> <1346460571.12165.0.camel@blabla> <98600E77-C847-4A5F-A2CD-6B134543E630@mcs.anl.gov> <1346462868.17915.0.camel@blabla> Message-ID: <16B1D248-C09F-4759-AA73-F3F50461D9F3@mcs.anl.gov> Yes, I tried that unlabeleddata pl = np.points; string parameters[] =readData(pl); and I got Execution failed: mypoints..dat (No such file or directory) On Aug 31, 2012, at 8:27 PM, Mihael Hategan wrote: > On Fri, 2012-08-31 at 20:11 -0500, Carolyn Phillips wrote: >> How would this line work for what I have below? >> >>>> string parameters[] =readData(np.points); >> > > unlabeleddata tmp = np.points; > string parameters[] = readData(tmp); > >> >> >> >> On Aug 31, 2012, at 7:49 PM, Mihael Hategan wrote: >> >>> Another bug. >>> >>> I committed a fix. In the mean time, the solution is: >>> >>> >>> errorlog fe = np.errorlog; >>> >>> int error = readData(fe); >>> >>> On Fri, 2012-08-31 at 19:29 -0500, Carolyn Phillips wrote: >>>> Hi Mihael, >>>> >>>> the reason I added the "@" was because >>>> >>>> now this (similar) line >>>> >>>> if(checkforerror==0) { >>>> string parameters[] =readData(np.points); >>>> } >>>> >>>> gives me this: >>>> >>>> Execution failed: >>>> mypoints..dat (No such file or directory) >>>> >>>> as in now its not getting the name of the file correct >>>> >>>> On Aug 31, 2012, at 7:17 PM, Mihael Hategan wrote: >>>> >>>>> @np.error means the file name of np.error which is known statically. So >>>>> readData(@np.error) can run as soon as the script starts. >>>>> >>>>> You probably want to say readData(np.error). >>>>> >>>>> Mihael >>>>> >>>>> >>>>> On Fri, 2012-08-31 at 18:55 -0500, Carolyn Phillips wrote: >>>>>> So I execute an atomic procedure to generate a datafile, and then next >>>>>> I want to do something with that data file. However, my program is >>>>>> trying to do something with the datafile before it has been written >>>>>> to. So something with order of execution is not working. I think the >>>>>> problem is that the name of my file exists, but the file itself does >>>>>> not yet, but execution proceeds anyway! >>>>>> >>>>>> Here are my lines >>>>>> >>>>>> type pointfile { >>>>>> unlabeleddata points; >>>>>> errorlog error; >>>>>> } >>>>>> >>>>>> # Generate Parameters >>>>>> pointfile np ; >>>>>> np = generatepoints(config,labeledpoints, "uniform", 50); >>>>>> >>>>>> int checkforerror = readData(@np.error); >>>>>> >>>>>> This gives an error : >>>>>> mypoints.error.dat (No such file or directory) >>>>>> >>>>>> If I comment out the last line.. all the files show up in the directory. (e.g. mypoints.points.dat and mypoints.error.dat) ) and if forget to remove the .dat files from a prior run, it also runs fine! >>>>>> >>>>>> How do you fix a problem like that? >>>>>> >>>>>> _______________________________________________ >>>>>> Swift-user mailing list >>>>>> Swift-user at ci.uchicago.edu >>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >>>>> >>>>> >>>> >>> >>> >> > > From hategan at mcs.anl.gov Sat Sep 1 13:57:29 2012 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 01 Sep 2012 11:57:29 -0700 Subject: [Swift-user] Output Files, ReadData and Order of Execution In-Reply-To: <16B1D248-C09F-4759-AA73-F3F50461D9F3@mcs.anl.gov> References: <12C4B0C3-95C6-42B4-BC95-B07588359E58@mcs.anl.gov> <1346458672.11980.1.camel@blabla> <1346460571.12165.0.camel@blabla> <98600E77-C847-4A5F-A2CD-6B134543E630@mcs.anl.gov> <1346462868.17915.0.camel@blabla> <16B1D248-C09F-4759-AA73-F3F50461D9F3@mcs.anl.gov> Message-ID: <1346525849.29086.0.camel@blabla> Can you post the entire script? On Sat, 2012-09-01 at 12:29 -0500, Carolyn Phillips wrote: > Yes, I tried that > > unlabeleddata pl = np.points; > string parameters[] =readData(pl); > > > and I got > > Execution failed: > mypoints..dat (No such file or directory) > > On Aug 31, 2012, at 8:27 PM, Mihael Hategan wrote: > > > On Fri, 2012-08-31 at 20:11 -0500, Carolyn Phillips wrote: > >> How would this line work for what I have below? > >> > >>>> string parameters[] =readData(np.points); > >> > > > > unlabeleddata tmp = np.points; > > string parameters[] = readData(tmp); > > > >> > >> > >> > >> On Aug 31, 2012, at 7:49 PM, Mihael Hategan wrote: > >> > >>> Another bug. > >>> > >>> I committed a fix. In the mean time, the solution is: > >>> > >>> > >>> errorlog fe = np.errorlog; > >>> > >>> int error = readData(fe); > >>> > >>> On Fri, 2012-08-31 at 19:29 -0500, Carolyn Phillips wrote: > >>>> Hi Mihael, > >>>> > >>>> the reason I added the "@" was because > >>>> > >>>> now this (similar) line > >>>> > >>>> if(checkforerror==0) { > >>>> string parameters[] =readData(np.points); > >>>> } > >>>> > >>>> gives me this: > >>>> > >>>> Execution failed: > >>>> mypoints..dat (No such file or directory) > >>>> > >>>> as in now its not getting the name of the file correct > >>>> > >>>> On Aug 31, 2012, at 7:17 PM, Mihael Hategan wrote: > >>>> > >>>>> @np.error means the file name of np.error which is known statically. So > >>>>> readData(@np.error) can run as soon as the script starts. > >>>>> > >>>>> You probably want to say readData(np.error). > >>>>> > >>>>> Mihael > >>>>> > >>>>> > >>>>> On Fri, 2012-08-31 at 18:55 -0500, Carolyn Phillips wrote: > >>>>>> So I execute an atomic procedure to generate a datafile, and then next > >>>>>> I want to do something with that data file. However, my program is > >>>>>> trying to do something with the datafile before it has been written > >>>>>> to. So something with order of execution is not working. I think the > >>>>>> problem is that the name of my file exists, but the file itself does > >>>>>> not yet, but execution proceeds anyway! > >>>>>> > >>>>>> Here are my lines > >>>>>> > >>>>>> type pointfile { > >>>>>> unlabeleddata points; > >>>>>> errorlog error; > >>>>>> } > >>>>>> > >>>>>> # Generate Parameters > >>>>>> pointfile np ; > >>>>>> np = generatepoints(config,labeledpoints, "uniform", 50); > >>>>>> > >>>>>> int checkforerror = readData(@np.error); > >>>>>> > >>>>>> This gives an error : > >>>>>> mypoints.error.dat (No such file or directory) > >>>>>> > >>>>>> If I comment out the last line.. all the files show up in the directory. (e.g. mypoints.points.dat and mypoints.error.dat) ) and if forget to remove the .dat files from a prior run, it also runs fine! > >>>>>> > >>>>>> How do you fix a problem like that? > >>>>>> > >>>>>> _______________________________________________ > >>>>>> Swift-user mailing list > >>>>>> Swift-user at ci.uchicago.edu > >>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > >>>>> > >>>>> > >>>> > >>> > >>> > >> > > > > > From cphillips at mcs.anl.gov Sat Sep 1 15:23:56 2012 From: cphillips at mcs.anl.gov (Carolyn Phillips) Date: Sat, 1 Sep 2012 15:23:56 -0500 Subject: [Swift-user] Output Files, ReadData and Order of Execution In-Reply-To: <1346525849.29086.0.camel@blabla> References: <12C4B0C3-95C6-42B4-BC95-B07588359E58@mcs.anl.gov> <1346458672.11980.1.camel@blabla> <1346460571.12165.0.camel@blabla> <98600E77-C847-4A5F-A2CD-6B134543E630@mcs.anl.gov> <1346462868.17915.0.camel@blabla> <16B1D248-C09F-4759-AA73-F3F50461D9F3@mcs.anl.gov> <1346525849.29086.0.camel@blabla> Message-ID: <2113F690-B9FA-4DC9-962E-F7EFB5B9A696@mcs.anl.gov> Sure There are a lot of extra stuff running around in the script, fyi # Types type file; type unlabeleddata; type labeleddata; type errorlog; # Structured Types type pointfile { unlabeleddata points; errorlog error; } type simulationfile { file output; } # Apps app (file o) cat (file i) { cat @i stdout=@o; } app (file o) cat2 (file i) { systeminfo stdout=@o; } app (pointfile o) generatepoints (file c, labeleddata f, string mode, int Npoints) { matlab_callgeneratepoints @c @f mode Npoints @o.points @o.error; } #app (simulationfile o) runSimulation(string p) #{ # launchjob p @o.output; #} #Files (using single file mapper) file config <"designspace.config">; labeleddata labeledpoints <"emptypoints.dat">; type pointlog; # Loop iterate passes { # Generate Parameters pointfile np ; np = generatepoints(config,labeledpoints, "uniform", 50); int checkforerror = readData(np.error); tracef("%s: %i\n", "Generate Parameters Error Value", checkforerror); # Issue Jobs #simulationfile simfiles[] ; if(checkforerror==0) { unlabeleddata pl = np.points; string parameters[] =readData(pl); foreach p,pindex in parameters { tracef("Launch Job for Parameters: %s\n", p); #simfiles[pindex] = runSimulation(p); } } # Analyze Jobs # Generate Prediction # creates an array of datafiles named swifttest..out to write to file out[]; # creates a default of 10 files foreach j in [1:@toInt(@arg("n","10"))] { file data<"data.txt">; out[j] = cat2(data); } # try writing the iteration to a log file file passlog <"passes.log">; passlog = writeData(passes); # try reading from another log file int readpasses = readData(passlog); # Write to the Output Log tracef("%s: %i\n", "Iteration :", passes); tracef("%s: %i\n", "Iteration Read :", readpasses); #} until (readpasses == 2); # Determine if Done } until (passes == 1); # Determine if Done On Sep 1, 2012, at 1:57 PM, Mihael Hategan wrote: > Can you post the entire script? > > On Sat, 2012-09-01 at 12:29 -0500, Carolyn Phillips wrote: >> Yes, I tried that >> >> unlabeleddata pl = np.points; >> string parameters[] =readData(pl); >> >> >> and I got >> >> Execution failed: >> mypoints..dat (No such file or directory) >> >> On Aug 31, 2012, at 8:27 PM, Mihael Hategan wrote: >> >>> On Fri, 2012-08-31 at 20:11 -0500, Carolyn Phillips wrote: >>>> How would this line work for what I have below? >>>> >>>>>> string parameters[] =readData(np.points); >>>> >>> >>> unlabeleddata tmp = np.points; >>> string parameters[] = readData(tmp); >>> >>>> >>>> >>>> >>>> On Aug 31, 2012, at 7:49 PM, Mihael Hategan wrote: >>>> >>>>> Another bug. >>>>> >>>>> I committed a fix. In the mean time, the solution is: >>>>> >>>>> >>>>> errorlog fe = np.errorlog; >>>>> >>>>> int error = readData(fe); >>>>> >>>>> On Fri, 2012-08-31 at 19:29 -0500, Carolyn Phillips wrote: >>>>>> Hi Mihael, >>>>>> >>>>>> the reason I added the "@" was because >>>>>> >>>>>> now this (similar) line >>>>>> >>>>>> if(checkforerror==0) { >>>>>> string parameters[] =readData(np.points); >>>>>> } >>>>>> >>>>>> gives me this: >>>>>> >>>>>> Execution failed: >>>>>> mypoints..dat (No such file or directory) >>>>>> >>>>>> as in now its not getting the name of the file correct >>>>>> >>>>>> On Aug 31, 2012, at 7:17 PM, Mihael Hategan wrote: >>>>>> >>>>>>> @np.error means the file name of np.error which is known statically. So >>>>>>> readData(@np.error) can run as soon as the script starts. >>>>>>> >>>>>>> You probably want to say readData(np.error). >>>>>>> >>>>>>> Mihael >>>>>>> >>>>>>> >>>>>>> On Fri, 2012-08-31 at 18:55 -0500, Carolyn Phillips wrote: >>>>>>>> So I execute an atomic procedure to generate a datafile, and then next >>>>>>>> I want to do something with that data file. However, my program is >>>>>>>> trying to do something with the datafile before it has been written >>>>>>>> to. So something with order of execution is not working. I think the >>>>>>>> problem is that the name of my file exists, but the file itself does >>>>>>>> not yet, but execution proceeds anyway! >>>>>>>> >>>>>>>> Here are my lines >>>>>>>> >>>>>>>> type pointfile { >>>>>>>> unlabeleddata points; >>>>>>>> errorlog error; >>>>>>>> } >>>>>>>> >>>>>>>> # Generate Parameters >>>>>>>> pointfile np ; >>>>>>>> np = generatepoints(config,labeledpoints, "uniform", 50); >>>>>>>> >>>>>>>> int checkforerror = readData(@np.error); >>>>>>>> >>>>>>>> This gives an error : >>>>>>>> mypoints.error.dat (No such file or directory) >>>>>>>> >>>>>>>> If I comment out the last line.. all the files show up in the directory. (e.g. mypoints.points.dat and mypoints.error.dat) ) and if forget to remove the .dat files from a prior run, it also runs fine! >>>>>>>> >>>>>>>> How do you fix a problem like that? >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Swift-user mailing list >>>>>>>> Swift-user at ci.uchicago.edu >>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>> >>> >>> >> > > From hategan at mcs.anl.gov Sat Sep 1 18:45:33 2012 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 01 Sep 2012 16:45:33 -0700 Subject: [Swift-user] Output Files, ReadData and Order of Execution In-Reply-To: <2113F690-B9FA-4DC9-962E-F7EFB5B9A696@mcs.anl.gov> References: <12C4B0C3-95C6-42B4-BC95-B07588359E58@mcs.anl.gov> <1346458672.11980.1.camel@blabla> <1346460571.12165.0.camel@blabla> <98600E77-C847-4A5F-A2CD-6B134543E630@mcs.anl.gov> <1346462868.17915.0.camel@blabla> <16B1D248-C09F-4759-AA73-F3F50461D9F3@mcs.anl.gov> <1346525849.29086.0.camel@blabla> <2113F690-B9FA-4DC9-962E-F7EFB5B9A696@mcs.anl.gov> Message-ID: <1346543133.1148.0.camel@blabla> The error comes from int checkforerror = readData(np.error); You have to use the workaround for both. On Sat, 2012-09-01 at 15:23 -0500, Carolyn Phillips wrote: > Sure > > There are a lot of extra stuff running around in the script, fyi > > # Types > type file; > type unlabeleddata; > type labeleddata; > type errorlog; > > # Structured Types > type pointfile { > unlabeleddata points; > errorlog error; > } > > type simulationfile { > file output; > } > > # Apps > app (file o) cat (file i) > { > cat @i stdout=@o; > } > > app (file o) cat2 (file i) > { > systeminfo stdout=@o; > } > > app (pointfile o) generatepoints (file c, labeleddata f, string mode, int Npoints) > { > matlab_callgeneratepoints @c @f mode Npoints @o.points @o.error; > } > > #app (simulationfile o) runSimulation(string p) > #{ > # launchjob p @o.output; > #} > > #Files (using single file mapper) > file config <"designspace.config">; > labeleddata labeledpoints <"emptypoints.dat">; > > type pointlog; > > # Loop > iterate passes { > > # Generate Parameters > pointfile np ; > np = generatepoints(config,labeledpoints, "uniform", 50); > > int checkforerror = readData(np.error); > tracef("%s: %i\n", "Generate Parameters Error Value", checkforerror); > > # Issue Jobs > #simulationfile simfiles[] ; > if(checkforerror==0) { > unlabeleddata pl = np.points; > string parameters[] =readData(pl); > foreach p,pindex in parameters { > tracef("Launch Job for Parameters: %s\n", p); > #simfiles[pindex] = runSimulation(p); > } > } > > # Analyze Jobs > > # Generate Prediction > > > > # creates an array of datafiles named swifttest..out to write to > file out[]; > > # creates a default of 10 files > foreach j in [1:@toInt(@arg("n","10"))] { > file data<"data.txt">; > out[j] = cat2(data); > } > > # try writing the iteration to a log file > file passlog <"passes.log">; > passlog = writeData(passes); > > # try reading from another log file > int readpasses = readData(passlog); > > # Write to the Output Log > tracef("%s: %i\n", "Iteration :", passes); > tracef("%s: %i\n", "Iteration Read :", readpasses); > > #} until (readpasses == 2); # Determine if Done > } until (passes == 1); # Determine if Done > > > On Sep 1, 2012, at 1:57 PM, Mihael Hategan wrote: > > > Can you post the entire script? > > > > On Sat, 2012-09-01 at 12:29 -0500, Carolyn Phillips wrote: > >> Yes, I tried that > >> > >> unlabeleddata pl = np.points; > >> string parameters[] =readData(pl); > >> > >> > >> and I got > >> > >> Execution failed: > >> mypoints..dat (No such file or directory) > >> > >> On Aug 31, 2012, at 8:27 PM, Mihael Hategan wrote: > >> > >>> On Fri, 2012-08-31 at 20:11 -0500, Carolyn Phillips wrote: > >>>> How would this line work for what I have below? > >>>> > >>>>>> string parameters[] =readData(np.points); > >>>> > >>> > >>> unlabeleddata tmp = np.points; > >>> string parameters[] = readData(tmp); > >>> > >>>> > >>>> > >>>> > >>>> On Aug 31, 2012, at 7:49 PM, Mihael Hategan wrote: > >>>> > >>>>> Another bug. > >>>>> > >>>>> I committed a fix. In the mean time, the solution is: > >>>>> > >>>>> > >>>>> errorlog fe = np.errorlog; > >>>>> > >>>>> int error = readData(fe); > >>>>> > >>>>> On Fri, 2012-08-31 at 19:29 -0500, Carolyn Phillips wrote: > >>>>>> Hi Mihael, > >>>>>> > >>>>>> the reason I added the "@" was because > >>>>>> > >>>>>> now this (similar) line > >>>>>> > >>>>>> if(checkforerror==0) { > >>>>>> string parameters[] =readData(np.points); > >>>>>> } > >>>>>> > >>>>>> gives me this: > >>>>>> > >>>>>> Execution failed: > >>>>>> mypoints..dat (No such file or directory) > >>>>>> > >>>>>> as in now its not getting the name of the file correct > >>>>>> > >>>>>> On Aug 31, 2012, at 7:17 PM, Mihael Hategan wrote: > >>>>>> > >>>>>>> @np.error means the file name of np.error which is known statically. So > >>>>>>> readData(@np.error) can run as soon as the script starts. > >>>>>>> > >>>>>>> You probably want to say readData(np.error). > >>>>>>> > >>>>>>> Mihael > >>>>>>> > >>>>>>> > >>>>>>> On Fri, 2012-08-31 at 18:55 -0500, Carolyn Phillips wrote: > >>>>>>>> So I execute an atomic procedure to generate a datafile, and then next > >>>>>>>> I want to do something with that data file. However, my program is > >>>>>>>> trying to do something with the datafile before it has been written > >>>>>>>> to. So something with order of execution is not working. I think the > >>>>>>>> problem is that the name of my file exists, but the file itself does > >>>>>>>> not yet, but execution proceeds anyway! > >>>>>>>> > >>>>>>>> Here are my lines > >>>>>>>> > >>>>>>>> type pointfile { > >>>>>>>> unlabeleddata points; > >>>>>>>> errorlog error; > >>>>>>>> } > >>>>>>>> > >>>>>>>> # Generate Parameters > >>>>>>>> pointfile np ; > >>>>>>>> np = generatepoints(config,labeledpoints, "uniform", 50); > >>>>>>>> > >>>>>>>> int checkforerror = readData(@np.error); > >>>>>>>> > >>>>>>>> This gives an error : > >>>>>>>> mypoints.error.dat (No such file or directory) > >>>>>>>> > >>>>>>>> If I comment out the last line.. all the files show up in the directory. (e.g. mypoints.points.dat and mypoints.error.dat) ) and if forget to remove the .dat files from a prior run, it also runs fine! > >>>>>>>> > >>>>>>>> How do you fix a problem like that? > >>>>>>>> > >>>>>>>> _______________________________________________ > >>>>>>>> Swift-user mailing list > >>>>>>>> Swift-user at ci.uchicago.edu > >>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > >>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>>> > >>>> > >>> > >>> > >> > > > > > From cphillips at mcs.anl.gov Sun Sep 2 22:16:05 2012 From: cphillips at mcs.anl.gov (Carolyn Phillips) Date: Sun, 2 Sep 2012 21:16:05 -0600 Subject: [Swift-user] Output Files, ReadData and Order of Execution In-Reply-To: <1346543133.1148.0.camel@blabla> References: <12C4B0C3-95C6-42B4-BC95-B07588359E58@mcs.anl.gov> <1346458672.11980.1.camel@blabla> <1346460571.12165.0.camel@blabla> <98600E77-C847-4A5F-A2CD-6B134543E630@mcs.anl.gov> <1346462868.17915.0.camel@blabla> <16B1D248-C09F-4759-AA73-F3F50461D9F3@mcs.anl.gov> <1346525849.29086.0.camel@blabla> <2113F690-B9FA-4DC9-962E-F7EFB5B9A696@mcs.anl.gov> <1346543133.1148.0.camel@blabla> Message-ID: You are right. That was my problem. (dumb!) Anyway. My next issue is that Swift is telling me it can't find a file that exists. Perhaps this is because it does not understand absolute directory paths the way I am specifying them? The short version is that, using a simple mapper, I specify the location as ;location="//scratch/midway/phillicl/SwiftJobs/" Note that I had to put two // at the beginning because the first backslash gets removed for some reason. Then I pass that file name to a script write to that file. But then Swift doesn't see the file Here is the more detailed version I have a script called launch jobs that does the following > cd /scratch/midway/phillicl/SwiftJobs > > mkdir Job.${1}.${2} > cd Job.${1}.${2} > > # Copy in some files and do some work > > pwd > ${3} Here is the swift script > # Types > type file; > type unlabeleddata; > type labeleddata; > type errorlog; > > # Structured Types > type pointfile { > unlabeleddata points; > errorlog error; > } > > type simulationfile { > file output; > } > > # Apps > app (file o) cat (file i) > { > cat @i stdout=@o; > } > > app (file o) cat2 (file i) > { > systeminfo stdout=@o; > } > > app (pointfile o) generatepoints (file c, labeleddata f, string mode, int Npoints) > { > matlab_callgeneratepoints @c @f mode Npoints @o.points @o.error; > } > > app (simulationfile o) runSimulation(string p,int passes, int pindex) > { > launchjob passes pindex @o.output; > } > > #Files (using single file mapper) > file config <"designspace.config">; > labeleddata labeledpoints <"emptypoints.dat">; > > type pointlog; > > # Loop > iterate passes { > > # Generate Parameters > pointfile np ; > np = generatepoints(config,labeledpoints, "uniform", 50); > > errorlog fe = np.error; > int checkforerror = readData(fe); > tracef("%s: %i\n", "Generate Parameters Error Value", checkforerror); > > # Issue Jobs > simulationfile simfiles[] ; > if(checkforerror==0) { > unlabeleddata pl = np.points; > string parameters[] =readData(pl); > foreach p,pindex in parameters { > tracef("Launch Job for Parameters: %s\n", p); > simfiles[pindex] = runSimulation(p,passes,pindex); > } > } > > # Analyze Jobs > > # Generate Prediction > > > > # creates an array of datafiles named swifttest..out to write to > file out[]; > > # creates a default of 10 files > foreach j in [1:@toInt(@arg("n","10"))] { > file data<"data.txt">; > out[j] = cat2(data); > } > > # try writing the iteration to a log file > file passlog <"passes.log">; > passlog = writeData(passes); > > # try reading from another log file > int readpasses = readData(passlog); > > # Write to the Output Log > tracef("%s: %i\n", "Iteration :", passes); > tracef("%s: %i\n", "Iteration Read :", readpasses); > > #} until (readpasses == 2); # Determine if Done > } until (passes == 1); # Determine if Done And Here is the error I get: EXCEPTION Exception in launchjob: Arguments: [0, 1, /scratch/midway/phillicl/SwiftJobs/0.0001.output.job] Host: pbs Directory: test-20120903-0303-pedfpqu8/jobs/f/launchjob-fff1mjxk stderr.txt: stdout.txt: ---- sys:exception @ vdl-int.k, line: 601 sys:throw @ vdl-int.k, line: 600 sys:catch @ vdl-int.k, line: 567 sys:try @ vdl-int.k, line: 469 task:allocatehost @ vdl-int.k, line: 419 vdl:execute2 @ execute-default.k, line: 23 sys:ignoreerrors @ execute-default.k, line: 21 sys:parallelfor @ execute-default.k, line: 20 sys:restartonerror @ execute-default.k, line: 16 sys:sequential @ execute-default.k, line: 14 sys:try @ execute-default.k, line: 13 sys:if @ execute-default.k, line: 12 sys:then @ execute-default.k, line: 11 sys:if @ execute-default.k, line: 10 vdl:execute @ test.kml, line: 182 run_simulation @ test.kml, line: 480 sys:parallel @ test.kml, line: 465 foreach @ test.kml, line: 456 sys:parallel @ test.kml, line: 427 sys:then @ test.kml, line: 409 sys:if @ test.kml, line: 404 sys:sequential @ test.kml, line: 402 sys:parallel @ test.kml, line: 315 iterate @ test.kml, line: 229 vdl:sequentialwithid @ test.kml, line: 226 vdl:mainp @ test.kml, line: 225 mainp @ vdl.k, line: 118 vdl:mains @ test.kml, line: 223 vdl:mains @ test.kml, line: 223 rlog:restartlog @ test.kml, line: 222 kernel:project @ test.kml, line: 2 test-20120903-0303-pedfpqu8 Caused by: The following output files were not created by the application: /scratch/midway/phillicl/SwiftJobs/0.0001.output.job Note that > ls /scratch/midway/phillicl/SwiftJobs/0.0001.output.job /scratch/midway/phillicl/SwiftJobs/0.0001.output.job On Sep 1, 2012, at 5:45 PM, Mihael Hategan wrote: > The error comes from int checkforerror = readData(np.error); > > You have to use the workaround for both. > > On Sat, 2012-09-01 at 15:23 -0500, Carolyn Phillips wrote: >> Sure >> >> There are a lot of extra stuff running around in the script, fyi >> >> # Types >> type file; >> type unlabeleddata; >> type labeleddata; >> type errorlog; >> >> # Structured Types >> type pointfile { >> unlabeleddata points; >> errorlog error; >> } >> >> type simulationfile { >> file output; >> } >> >> # Apps >> app (file o) cat (file i) >> { >> cat @i stdout=@o; >> } >> >> app (file o) cat2 (file i) >> { >> systeminfo stdout=@o; >> } >> >> app (pointfile o) generatepoints (file c, labeleddata f, string mode, int Npoints) >> { >> matlab_callgeneratepoints @c @f mode Npoints @o.points @o.error; >> } >> >> #app (simulationfile o) runSimulation(string p) >> #{ >> # launchjob p @o.output; >> #} >> >> #Files (using single file mapper) >> file config <"designspace.config">; >> labeleddata labeledpoints <"emptypoints.dat">; >> >> type pointlog; >> >> # Loop >> iterate passes { >> >> # Generate Parameters >> pointfile np ; >> np = generatepoints(config,labeledpoints, "uniform", 50); >> >> int checkforerror = readData(np.error); >> tracef("%s: %i\n", "Generate Parameters Error Value", checkforerror); >> >> # Issue Jobs >> #simulationfile simfiles[] ; >> if(checkforerror==0) { >> unlabeleddata pl = np.points; >> string parameters[] =readData(pl); >> foreach p,pindex in parameters { >> tracef("Launch Job for Parameters: %s\n", p); >> #simfiles[pindex] = runSimulation(p); >> } >> } >> >> # Analyze Jobs >> >> # Generate Prediction >> >> >> >> # creates an array of datafiles named swifttest..out to write to >> file out[]; >> >> # creates a default of 10 files >> foreach j in [1:@toInt(@arg("n","10"))] { >> file data<"data.txt">; >> out[j] = cat2(data); >> } >> >> # try writing the iteration to a log file >> file passlog <"passes.log">; >> passlog = writeData(passes); >> >> # try reading from another log file >> int readpasses = readData(passlog); >> >> # Write to the Output Log >> tracef("%s: %i\n", "Iteration :", passes); >> tracef("%s: %i\n", "Iteration Read :", readpasses); >> >> #} until (readpasses == 2); # Determine if Done >> } until (passes == 1); # Determine if Done >> >> >> On Sep 1, 2012, at 1:57 PM, Mihael Hategan wrote: >> >>> Can you post the entire script? >>> >>> On Sat, 2012-09-01 at 12:29 -0500, Carolyn Phillips wrote: >>>> Yes, I tried that >>>> >>>> unlabeleddata pl = np.points; >>>> string parameters[] =readData(pl); >>>> >>>> >>>> and I got >>>> >>>> Execution failed: >>>> mypoints..dat (No such file or directory) >>>> >>>> On Aug 31, 2012, at 8:27 PM, Mihael Hategan wrote: >>>> >>>>> On Fri, 2012-08-31 at 20:11 -0500, Carolyn Phillips wrote: >>>>>> How would this line work for what I have below? >>>>>> >>>>>>>> string parameters[] =readData(np.points); >>>>>> >>>>> >>>>> unlabeleddata tmp = np.points; >>>>> string parameters[] = readData(tmp); >>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Aug 31, 2012, at 7:49 PM, Mihael Hategan wrote: >>>>>> >>>>>>> Another bug. >>>>>>> >>>>>>> I committed a fix. In the mean time, the solution is: >>>>>>> >>>>>>> >>>>>>> errorlog fe = np.errorlog; >>>>>>> >>>>>>> int error = readData(fe); >>>>>>> >>>>>>> On Fri, 2012-08-31 at 19:29 -0500, Carolyn Phillips wrote: >>>>>>>> Hi Mihael, >>>>>>>> >>>>>>>> the reason I added the "@" was because >>>>>>>> >>>>>>>> now this (similar) line >>>>>>>> >>>>>>>> if(checkforerror==0) { >>>>>>>> string parameters[] =readData(np.points); >>>>>>>> } >>>>>>>> >>>>>>>> gives me this: >>>>>>>> >>>>>>>> Execution failed: >>>>>>>> mypoints..dat (No such file or directory) >>>>>>>> >>>>>>>> as in now its not getting the name of the file correct >>>>>>>> >>>>>>>> On Aug 31, 2012, at 7:17 PM, Mihael Hategan wrote: >>>>>>>> >>>>>>>>> @np.error means the file name of np.error which is known statically. So >>>>>>>>> readData(@np.error) can run as soon as the script starts. >>>>>>>>> >>>>>>>>> You probably want to say readData(np.error). >>>>>>>>> >>>>>>>>> Mihael >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, 2012-08-31 at 18:55 -0500, Carolyn Phillips wrote: >>>>>>>>>> So I execute an atomic procedure to generate a datafile, and then next >>>>>>>>>> I want to do something with that data file. However, my program is >>>>>>>>>> trying to do something with the datafile before it has been written >>>>>>>>>> to. So something with order of execution is not working. I think the >>>>>>>>>> problem is that the name of my file exists, but the file itself does >>>>>>>>>> not yet, but execution proceeds anyway! >>>>>>>>>> >>>>>>>>>> Here are my lines >>>>>>>>>> >>>>>>>>>> type pointfile { >>>>>>>>>> unlabeleddata points; >>>>>>>>>> errorlog error; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> # Generate Parameters >>>>>>>>>> pointfile np ; >>>>>>>>>> np = generatepoints(config,labeledpoints, "uniform", 50); >>>>>>>>>> >>>>>>>>>> int checkforerror = readData(@np.error); >>>>>>>>>> >>>>>>>>>> This gives an error : >>>>>>>>>> mypoints.error.dat (No such file or directory) >>>>>>>>>> >>>>>>>>>> If I comment out the last line.. all the files show up in the directory. (e.g. mypoints.points.dat and mypoints.error.dat) ) and if forget to remove the .dat files from a prior run, it also runs fine! >>>>>>>>>> >>>>>>>>>> How do you fix a problem like that? >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Swift-user mailing list >>>>>>>>>> Swift-user at ci.uchicago.edu >>>>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>> >>> >>> >> > > From davidk at ci.uchicago.edu Sun Sep 2 23:17:40 2012 From: davidk at ci.uchicago.edu (David Kelly) Date: Sun, 2 Sep 2012 23:17:40 -0500 (CDT) Subject: [Swift-user] Concatenating members of a string array In-Reply-To: <1313924086.15693.1346184264019.JavaMail.root@zimbra.anl.gov> Message-ID: <243129305.58762.1346645860026.JavaMail.root@zimbra-mb2.anl.gov> Sheri, This should be in trunk now. There is an example at http://www.ci.uchicago.edu/swift/guides/trunk/userguide/userguide.html#_strjoin. Please let me know if you have any questions. Thanks, David ----- Original Message ----- > From: "Michael Wilde" > To: "Sheri Mickelson" > Cc: "swift user" > Sent: Tuesday, August 28, 2012 3:04:24 PM > Subject: [Swift-user] Concatenating members of a string array > Sheri, in response to your off-list question: here's two ways to > concatenate a string array into a single string. We should provide a > primitive to do this better / more reliably. > > The code below is a quick hack and needs polishing for the User Guide. > > - Mike > > > string sa[] = ["aaa","bbb","ccc","ddd","eee"]; > > string cs[]; > > cs[-1] = ""; > > foreach e,i in sa { > cs[i] = @strcat(cs[i-1], " ", e); > } > > trace("cat=",cs[@length(sa)-1]); > > type file; > > app (file o) echo (string s[]) > { > echo s stdout=@filename(o); > } > > file f = echo(sa); > > string os = readData(f); > > trace("os=",os); > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user From wilde at mcs.anl.gov Wed Sep 5 08:40:28 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Wed, 5 Sep 2012 08:40:28 -0500 (CDT) Subject: [Swift-user] Output Files, ReadData and Order of Execution In-Reply-To: Message-ID: <546410676.309.1346852428453.JavaMail.root@zimbra.anl.gov> Hi Carolyn, I think this error is due to the fact that the launchjob script is not coded to match Swift's file management conventions. Unless you declare that you want to use "Direct" file management, Swift will expect output files to be created relative to the directory in which it runs your app() scripts. Thats why it was pulling off the leading "/". By putting a // at the front of the pathname, you were inadvertantly causing your launchjob script to place its output file in a different directory than where Swift was expecting it. There's a further mismatch I think between the mapped filename from simple_mapper (which defaults to 4-digit strings for indices) and the names that launchjob is trying to create. I'll need to send further clarification later, but for now, could you try the following: - go back to using a single leading "/" - comment out the mkdir and cd in launchjob, as $3 contains the correct pathname to write to (which will be a long relative pathname without the leaning "/") I think what you really want here is "DIRECT" file management mode, explained at: http://www.ci.uchicago.edu/swift/guides/trunk/userguide/userguide.html#_policy_descriptions We need to enhance the User Guide to explain this clearly and fully. - Mike ----- Original Message ----- > From: "Carolyn Phillips" > To: "Mihael Hategan" > Cc: swift-user at ci.uchicago.edu > Sent: Sunday, September 2, 2012 10:16:05 PM > Subject: Re: [Swift-user] Output Files, ReadData and Order of Execution > You are right. That was my problem. (dumb!) > > Anyway. My next issue is that Swift is telling me it can't find a file > that exists. Perhaps this is because it does not understand absolute > directory paths the way I am specifying them? > > The short version is that, using a simple mapper, I specify the > location as ;location="//scratch/midway/phillicl/SwiftJobs/" Note that > I had to put two // at the beginning because the first backslash gets > removed for some reason. Then I pass that file name to a script write > to that file. But then Swift doesn't see the file > > Here is the more detailed version > > I have a script called launch jobs that does the following > > cd /scratch/midway/phillicl/SwiftJobs > > > > mkdir Job.${1}.${2} > > cd Job.${1}.${2} > > > > # Copy in some files and do some work > > > > pwd > ${3} > > > > Here is the swift script > > > # Types > > type file; > > type unlabeleddata; > > type labeleddata; > > type errorlog; > > > > # Structured Types > > type pointfile { > > unlabeleddata points; > > errorlog error; > > } > > > > type simulationfile { > > file output; > > } > > > > # Apps > > app (file o) cat (file i) > > { > > cat @i stdout=@o; > > } > > > > app (file o) cat2 (file i) > > { > > systeminfo stdout=@o; > > } > > > > app (pointfile o) generatepoints (file c, labeleddata f, string > > mode, int Npoints) > > { > > matlab_callgeneratepoints @c @f mode Npoints @o.points @o.error; > > } > > > > app (simulationfile o) runSimulation(string p,int passes, int > > pindex) > > { > > launchjob passes pindex @o.output; > > } > > > > #Files (using single file mapper) > > file config <"designspace.config">; > > labeleddata labeledpoints <"emptypoints.dat">; > > > > type pointlog; > > > > # Loop > > iterate passes { > > > > # Generate Parameters > > pointfile np ; > > np = generatepoints(config,labeledpoints, "uniform", 50); > > > > errorlog fe = np.error; > > int checkforerror = readData(fe); > > tracef("%s: %i\n", "Generate Parameters Error Value", > > checkforerror); > > > > # Issue Jobs > > simulationfile simfiles[] > > ; > > if(checkforerror==0) { > > unlabeleddata pl = np.points; > > string parameters[] =readData(pl); > > foreach p,pindex in parameters { > > tracef("Launch Job for Parameters: %s\n", p); > > simfiles[pindex] = runSimulation(p,passes,pindex); > > } > > } > > > > # Analyze Jobs > > > > # Generate Prediction > > > > > > > > # creates an array of datafiles named swifttest..out to > > write to > > file out[] > prefix=@strcat("swifttest.",passes,"."),suffix=".out">; > > > > # creates a default of 10 files > > foreach j in [1:@toInt(@arg("n","10"))] { > > file data<"data.txt">; > > out[j] = cat2(data); > > } > > > > # try writing the iteration to a log file > > file passlog <"passes.log">; > > passlog = writeData(passes); > > > > # try reading from another log file > > int readpasses = readData(passlog); > > > > # Write to the Output Log > > tracef("%s: %i\n", "Iteration :", passes); > > tracef("%s: %i\n", "Iteration Read :", readpasses); > > > > #} until (readpasses == 2); # Determine if Done > > } until (passes == 1); # Determine if Done > > > And Here is the error I get: > > EXCEPTION Exception in launchjob: > Arguments: [0, 1, > /scratch/midway/phillicl/SwiftJobs/0.0001.output.job] > Host: pbs > Directory: test-20120903-0303-pedfpqu8/jobs/f/launchjob-fff1mjxk > stderr.txt: > stdout.txt: > ---- > > sys:exception @ vdl-int.k, line: 601 > sys:throw @ vdl-int.k, line: 600 > sys:catch @ vdl-int.k, line: 567 > sys:try @ vdl-int.k, line: 469 > task:allocatehost @ vdl-int.k, line: 419 > vdl:execute2 @ execute-default.k, line: 23 > sys:ignoreerrors @ execute-default.k, line: 21 > sys:parallelfor @ execute-default.k, line: 20 > sys:restartonerror @ execute-default.k, line: 16 > sys:sequential @ execute-default.k, line: 14 > sys:try @ execute-default.k, line: 13 > sys:if @ execute-default.k, line: 12 > sys:then @ execute-default.k, line: 11 > sys:if @ execute-default.k, line: 10 > vdl:execute @ test.kml, line: 182 > run_simulation @ test.kml, line: 480 > sys:parallel @ test.kml, line: 465 > foreach @ test.kml, line: 456 > sys:parallel @ test.kml, line: 427 > sys:then @ test.kml, line: 409 > sys:if @ test.kml, line: 404 > sys:sequential @ test.kml, line: 402 > sys:parallel @ test.kml, line: 315 > iterate @ test.kml, line: 229 > vdl:sequentialwithid @ test.kml, line: 226 > vdl:mainp @ test.kml, line: 225 > mainp @ vdl.k, line: 118 > vdl:mains @ test.kml, line: 223 > vdl:mains @ test.kml, line: 223 > rlog:restartlog @ test.kml, line: 222 > kernel:project @ test.kml, line: 2 > test-20120903-0303-pedfpqu8 > Caused by: The following output files were not created by the > application: /scratch/midway/phillicl/SwiftJobs/0.0001.output.job > > Note that > > ls /scratch/midway/phillicl/SwiftJobs/0.0001.output.job > /scratch/midway/phillicl/SwiftJobs/0.0001.output.job > > > On Sep 1, 2012, at 5:45 PM, Mihael Hategan > wrote: > > > The error comes from int checkforerror = readData(np.error); > > > > You have to use the workaround for both. > > > > On Sat, 2012-09-01 at 15:23 -0500, Carolyn Phillips wrote: > >> Sure > >> > >> There are a lot of extra stuff running around in the script, fyi > >> > >> # Types > >> type file; > >> type unlabeleddata; > >> type labeleddata; > >> type errorlog; > >> > >> # Structured Types > >> type pointfile { > >> unlabeleddata points; > >> errorlog error; > >> } > >> > >> type simulationfile { > >> file output; > >> } > >> > >> # Apps > >> app (file o) cat (file i) > >> { > >> cat @i stdout=@o; > >> } > >> > >> app (file o) cat2 (file i) > >> { > >> systeminfo stdout=@o; > >> } > >> > >> app (pointfile o) generatepoints (file c, labeleddata f, string > >> mode, int Npoints) > >> { > >> matlab_callgeneratepoints @c @f mode Npoints @o.points @o.error; > >> } > >> > >> #app (simulationfile o) runSimulation(string p) > >> #{ > >> # launchjob p @o.output; > >> #} > >> > >> #Files (using single file mapper) > >> file config <"designspace.config">; > >> labeleddata labeledpoints <"emptypoints.dat">; > >> > >> type pointlog; > >> > >> # Loop > >> iterate passes { > >> > >> # Generate Parameters > >> pointfile np ; > >> np = generatepoints(config,labeledpoints, "uniform", 50); > >> > >> int checkforerror = readData(np.error); > >> tracef("%s: %i\n", "Generate Parameters Error Value", > >> checkforerror); > >> > >> # Issue Jobs > >> #simulationfile simfiles[] > >> ; > >> if(checkforerror==0) { > >> unlabeleddata pl = np.points; > >> string parameters[] =readData(pl); > >> foreach p,pindex in parameters { > >> tracef("Launch Job for Parameters: %s\n", p); > >> #simfiles[pindex] = runSimulation(p); > >> } > >> } > >> > >> # Analyze Jobs > >> > >> # Generate Prediction > >> > >> > >> > >> # creates an array of datafiles named swifttest..out to > >> write to > >> file out[] >> prefix=@strcat("swifttest.",passes,"."),suffix=".out">; > >> > >> # creates a default of 10 files > >> foreach j in [1:@toInt(@arg("n","10"))] { > >> file data<"data.txt">; > >> out[j] = cat2(data); > >> } > >> > >> # try writing the iteration to a log file > >> file passlog <"passes.log">; > >> passlog = writeData(passes); > >> > >> # try reading from another log file > >> int readpasses = readData(passlog); > >> > >> # Write to the Output Log > >> tracef("%s: %i\n", "Iteration :", passes); > >> tracef("%s: %i\n", "Iteration Read :", readpasses); > >> > >> #} until (readpasses == 2); # Determine if Done > >> } until (passes == 1); # Determine if Done > >> > >> > >> On Sep 1, 2012, at 1:57 PM, Mihael Hategan > >> wrote: > >> > >>> Can you post the entire script? > >>> > >>> On Sat, 2012-09-01 at 12:29 -0500, Carolyn Phillips wrote: > >>>> Yes, I tried that > >>>> > >>>> unlabeleddata pl = np.points; > >>>> string parameters[] =readData(pl); > >>>> > >>>> > >>>> and I got > >>>> > >>>> Execution failed: > >>>> mypoints..dat (No such file or directory) > >>>> > >>>> On Aug 31, 2012, at 8:27 PM, Mihael Hategan > >>>> wrote: > >>>> > >>>>> On Fri, 2012-08-31 at 20:11 -0500, Carolyn Phillips wrote: > >>>>>> How would this line work for what I have below? > >>>>>> > >>>>>>>> string parameters[] =readData(np.points); > >>>>>> > >>>>> > >>>>> unlabeleddata tmp = np.points; > >>>>> string parameters[] = readData(tmp); > >>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Aug 31, 2012, at 7:49 PM, Mihael Hategan > >>>>>> wrote: > >>>>>> > >>>>>>> Another bug. > >>>>>>> > >>>>>>> I committed a fix. In the mean time, the solution is: > >>>>>>> > >>>>>>> > >>>>>>> errorlog fe = np.errorlog; > >>>>>>> > >>>>>>> int error = readData(fe); > >>>>>>> > >>>>>>> On Fri, 2012-08-31 at 19:29 -0500, Carolyn Phillips wrote: > >>>>>>>> Hi Mihael, > >>>>>>>> > >>>>>>>> the reason I added the "@" was because > >>>>>>>> > >>>>>>>> now this (similar) line > >>>>>>>> > >>>>>>>> if(checkforerror==0) { > >>>>>>>> string parameters[] =readData(np.points); > >>>>>>>> } > >>>>>>>> > >>>>>>>> gives me this: > >>>>>>>> > >>>>>>>> Execution failed: > >>>>>>>> mypoints..dat (No such file or directory) > >>>>>>>> > >>>>>>>> as in now its not getting the name of the file correct > >>>>>>>> > >>>>>>>> On Aug 31, 2012, at 7:17 PM, Mihael Hategan > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> @np.error means the file name of np.error which is known > >>>>>>>>> statically. So > >>>>>>>>> readData(@np.error) can run as soon as the script starts. > >>>>>>>>> > >>>>>>>>> You probably want to say readData(np.error). > >>>>>>>>> > >>>>>>>>> Mihael > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Fri, 2012-08-31 at 18:55 -0500, Carolyn Phillips wrote: > >>>>>>>>>> So I execute an atomic procedure to generate a datafile, > >>>>>>>>>> and then next > >>>>>>>>>> I want to do something with that data file. However, my > >>>>>>>>>> program is > >>>>>>>>>> trying to do something with the datafile before it has been > >>>>>>>>>> written > >>>>>>>>>> to. So something with order of execution is not working. I > >>>>>>>>>> think the > >>>>>>>>>> problem is that the name of my file exists, but the file > >>>>>>>>>> itself does > >>>>>>>>>> not yet, but execution proceeds anyway! > >>>>>>>>>> > >>>>>>>>>> Here are my lines > >>>>>>>>>> > >>>>>>>>>> type pointfile { > >>>>>>>>>> unlabeleddata points; > >>>>>>>>>> errorlog error; > >>>>>>>>>> } > >>>>>>>>>> > >>>>>>>>>> # Generate Parameters > >>>>>>>>>> pointfile np > >>>>>>>>>> ; > >>>>>>>>>> np = generatepoints(config,labeledpoints, "uniform", 50); > >>>>>>>>>> > >>>>>>>>>> int checkforerror = readData(@np.error); > >>>>>>>>>> > >>>>>>>>>> This gives an error : > >>>>>>>>>> mypoints.error.dat (No such file or directory) > >>>>>>>>>> > >>>>>>>>>> If I comment out the last line.. all the files show up in > >>>>>>>>>> the directory. (e.g. mypoints.points.dat and > >>>>>>>>>> mypoints.error.dat) ) and if forget to remove the .dat > >>>>>>>>>> files from a prior run, it also runs fine! > >>>>>>>>>> > >>>>>>>>>> How do you fix a problem like that? > >>>>>>>>>> > >>>>>>>>>> _______________________________________________ > >>>>>>>>>> Swift-user mailing list > >>>>>>>>>> Swift-user at ci.uchicago.edu > >>>>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > >>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>>> > >>>> > >>> > >>> > >> > > > > > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Fri Sep 7 14:52:43 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 7 Sep 2012 14:52:43 -0500 (CDT) Subject: [Swift-user] Output Files, ReadData and Order of Execution In-Reply-To: <546410676.309.1346852428453.JavaMail.root@zimbra.anl.gov> Message-ID: <1228936297.5835.1347047563931.JavaMail.root@zimbra.anl.gov> Carolyn, to follow up on your question: below is an example of using the "direct" file access mode. - Mike #----- The swift script $ cat catsndirect.swift type file; app (file o) cat (file i) { cat @i stdout=@o; } file out[]; foreach j in [1:@toint(@arg("n","1"))] { file data<"/tmp/wilde/indir/data.txt">; out[j] = cat(data); } #----- The "cdm" file: $ cat direct rule .* DIRECT / #----- The command line: $ swift -config cf -cdm.file direct -tc.file tc -sites.file sites.xml catsndirect.swift -n=10 #----- The output and input dirs: $ ls -lr /tmp/wilde/{in,out}dir /tmp/wilde/outdir: total 40 -rw-r--r-- 1 wilde ci-users 8 Sep 7 14:11 f.0010.out -rw-r--r-- 1 wilde ci-users 8 Sep 7 14:11 f.0009.out -rw-r--r-- 1 wilde ci-users 8 Sep 7 14:11 f.0008.out -rw-r--r-- 1 wilde ci-users 8 Sep 7 14:11 f.0007.out -rw-r--r-- 1 wilde ci-users 8 Sep 7 14:11 f.0006.out -rw-r--r-- 1 wilde ci-users 8 Sep 7 14:11 f.0005.out -rw-r--r-- 1 wilde ci-users 8 Sep 7 14:11 f.0004.out -rw-r--r-- 1 wilde ci-users 8 Sep 7 14:11 f.0003.out -rw-r--r-- 1 wilde ci-users 8 Sep 7 14:11 f.0002.out -rw-r--r-- 1 wilde ci-users 8 Sep 7 14:11 f.0001.out /tmp/wilde/indir: total 4 -rw-r--r-- 1 wilde ci-users 8 Sep 7 13:47 data.txt com$ ----- Original Message ----- > From: "Michael Wilde" > To: "Carolyn Phillips" > Cc: swift-user at ci.uchicago.edu > Sent: Wednesday, September 5, 2012 8:40:28 AM > Subject: Re: [Swift-user] Output Files, ReadData and Order of Execution > Hi Carolyn, > > I think this error is due to the fact that the launchjob script is not > coded to match Swift's file management conventions. > > Unless you declare that you want to use "Direct" file management, > Swift will expect output files to be created relative to the directory > in which it runs your app() scripts. Thats why it was pulling off the > leading "/". By putting a // at the front of the pathname, you were > inadvertantly causing your launchjob script to place its output file > in a different directory than where Swift was expecting it. > > There's a further mismatch I think between the mapped filename from > simple_mapper (which defaults to 4-digit strings for indices) and the > names that launchjob is trying to create. > > I'll need to send further clarification later, but for now, could you > try the following: > > - go back to using a single leading "/" > - comment out the mkdir and cd in launchjob, as $3 contains the > correct pathname to write to (which will be a long relative pathname > without the leaning "/") > > I think what you really want here is "DIRECT" file management mode, > explained at: > > http://www.ci.uchicago.edu/swift/guides/trunk/userguide/userguide.html#_policy_descriptions > > We need to enhance the User Guide to explain this clearly and fully. > > - Mike > > ----- Original Message ----- > > From: "Carolyn Phillips" > > To: "Mihael Hategan" > > Cc: swift-user at ci.uchicago.edu > > Sent: Sunday, September 2, 2012 10:16:05 PM > > Subject: Re: [Swift-user] Output Files, ReadData and Order of > > Execution > > You are right. That was my problem. (dumb!) > > > > Anyway. My next issue is that Swift is telling me it can't find a > > file > > that exists. Perhaps this is because it does not understand absolute > > directory paths the way I am specifying them? > > > > The short version is that, using a simple mapper, I specify the > > location as ;location="//scratch/midway/phillicl/SwiftJobs/" Note > > that > > I had to put two // at the beginning because the first backslash > > gets > > removed for some reason. Then I pass that file name to a script > > write > > to that file. But then Swift doesn't see the file > > > > Here is the more detailed version > > > > I have a script called launch jobs that does the following > > > cd /scratch/midway/phillicl/SwiftJobs > > > > > > mkdir Job.${1}.${2} > > > cd Job.${1}.${2} > > > > > > # Copy in some files and do some work > > > > > > pwd > ${3} > > > > > > > > Here is the swift script > > > > > # Types > > > type file; > > > type unlabeleddata; > > > type labeleddata; > > > type errorlog; > > > > > > # Structured Types > > > type pointfile { > > > unlabeleddata points; > > > errorlog error; > > > } > > > > > > type simulationfile { > > > file output; > > > } > > > > > > # Apps > > > app (file o) cat (file i) > > > { > > > cat @i stdout=@o; > > > } > > > > > > app (file o) cat2 (file i) > > > { > > > systeminfo stdout=@o; > > > } > > > > > > app (pointfile o) generatepoints (file c, labeleddata f, string > > > mode, int Npoints) > > > { > > > matlab_callgeneratepoints @c @f mode Npoints @o.points @o.error; > > > } > > > > > > app (simulationfile o) runSimulation(string p,int passes, int > > > pindex) > > > { > > > launchjob passes pindex @o.output; > > > } > > > > > > #Files (using single file mapper) > > > file config <"designspace.config">; > > > labeleddata labeledpoints <"emptypoints.dat">; > > > > > > type pointlog; > > > > > > # Loop > > > iterate passes { > > > > > > # Generate Parameters > > > pointfile np ; > > > np = generatepoints(config,labeledpoints, "uniform", 50); > > > > > > errorlog fe = np.error; > > > int checkforerror = readData(fe); > > > tracef("%s: %i\n", "Generate Parameters Error Value", > > > checkforerror); > > > > > > # Issue Jobs > > > simulationfile simfiles[] > > > ; > > > if(checkforerror==0) { > > > unlabeleddata pl = np.points; > > > string parameters[] =readData(pl); > > > foreach p,pindex in parameters { > > > tracef("Launch Job for Parameters: %s\n", p); > > > simfiles[pindex] = runSimulation(p,passes,pindex); > > > } > > > } > > > > > > # Analyze Jobs > > > > > > # Generate Prediction > > > > > > > > > > > > # creates an array of datafiles named swifttest..out > > > to > > > write to > > > file out[] > > prefix=@strcat("swifttest.",passes,"."),suffix=".out">; > > > > > > # creates a default of 10 files > > > foreach j in [1:@toInt(@arg("n","10"))] { > > > file data<"data.txt">; > > > out[j] = cat2(data); > > > } > > > > > > # try writing the iteration to a log file > > > file passlog <"passes.log">; > > > passlog = writeData(passes); > > > > > > # try reading from another log file > > > int readpasses = readData(passlog); > > > > > > # Write to the Output Log > > > tracef("%s: %i\n", "Iteration :", passes); > > > tracef("%s: %i\n", "Iteration Read :", readpasses); > > > > > > #} until (readpasses == 2); # Determine if Done > > > } until (passes == 1); # Determine if Done > > > > > > And Here is the error I get: > > > > EXCEPTION Exception in launchjob: > > Arguments: [0, 1, > > /scratch/midway/phillicl/SwiftJobs/0.0001.output.job] > > Host: pbs > > Directory: test-20120903-0303-pedfpqu8/jobs/f/launchjob-fff1mjxk > > stderr.txt: > > stdout.txt: > > ---- > > > > sys:exception @ vdl-int.k, line: 601 > > sys:throw @ vdl-int.k, line: 600 > > sys:catch @ vdl-int.k, line: 567 > > sys:try @ vdl-int.k, line: 469 > > task:allocatehost @ vdl-int.k, line: 419 > > vdl:execute2 @ execute-default.k, line: 23 > > sys:ignoreerrors @ execute-default.k, line: 21 > > sys:parallelfor @ execute-default.k, line: 20 > > sys:restartonerror @ execute-default.k, line: 16 > > sys:sequential @ execute-default.k, line: 14 > > sys:try @ execute-default.k, line: 13 > > sys:if @ execute-default.k, line: 12 > > sys:then @ execute-default.k, line: 11 > > sys:if @ execute-default.k, line: 10 > > vdl:execute @ test.kml, line: 182 > > run_simulation @ test.kml, line: 480 > > sys:parallel @ test.kml, line: 465 > > foreach @ test.kml, line: 456 > > sys:parallel @ test.kml, line: 427 > > sys:then @ test.kml, line: 409 > > sys:if @ test.kml, line: 404 > > sys:sequential @ test.kml, line: 402 > > sys:parallel @ test.kml, line: 315 > > iterate @ test.kml, line: 229 > > vdl:sequentialwithid @ test.kml, line: 226 > > vdl:mainp @ test.kml, line: 225 > > mainp @ vdl.k, line: 118 > > vdl:mains @ test.kml, line: 223 > > vdl:mains @ test.kml, line: 223 > > rlog:restartlog @ test.kml, line: 222 > > kernel:project @ test.kml, line: 2 > > test-20120903-0303-pedfpqu8 > > Caused by: The following output files were not created by the > > application: /scratch/midway/phillicl/SwiftJobs/0.0001.output.job > > > > Note that > > > ls /scratch/midway/phillicl/SwiftJobs/0.0001.output.job > > /scratch/midway/phillicl/SwiftJobs/0.0001.output.job > > > > > > On Sep 1, 2012, at 5:45 PM, Mihael Hategan > > wrote: > > > > > The error comes from int checkforerror = readData(np.error); > > > > > > You have to use the workaround for both. > > > > > > On Sat, 2012-09-01 at 15:23 -0500, Carolyn Phillips wrote: > > >> Sure > > >> > > >> There are a lot of extra stuff running around in the script, fyi > > >> > > >> # Types > > >> type file; > > >> type unlabeleddata; > > >> type labeleddata; > > >> type errorlog; > > >> > > >> # Structured Types > > >> type pointfile { > > >> unlabeleddata points; > > >> errorlog error; > > >> } > > >> > > >> type simulationfile { > > >> file output; > > >> } > > >> > > >> # Apps > > >> app (file o) cat (file i) > > >> { > > >> cat @i stdout=@o; > > >> } > > >> > > >> app (file o) cat2 (file i) > > >> { > > >> systeminfo stdout=@o; > > >> } > > >> > > >> app (pointfile o) generatepoints (file c, labeleddata f, string > > >> mode, int Npoints) > > >> { > > >> matlab_callgeneratepoints @c @f mode Npoints @o.points @o.error; > > >> } > > >> > > >> #app (simulationfile o) runSimulation(string p) > > >> #{ > > >> # launchjob p @o.output; > > >> #} > > >> > > >> #Files (using single file mapper) > > >> file config <"designspace.config">; > > >> labeleddata labeledpoints <"emptypoints.dat">; > > >> > > >> type pointlog; > > >> > > >> # Loop > > >> iterate passes { > > >> > > >> # Generate Parameters > > >> pointfile np ; > > >> np = generatepoints(config,labeledpoints, "uniform", 50); > > >> > > >> int checkforerror = readData(np.error); > > >> tracef("%s: %i\n", "Generate Parameters Error Value", > > >> checkforerror); > > >> > > >> # Issue Jobs > > >> #simulationfile simfiles[] > > >> ; > > >> if(checkforerror==0) { > > >> unlabeleddata pl = np.points; > > >> string parameters[] =readData(pl); > > >> foreach p,pindex in parameters { > > >> tracef("Launch Job for Parameters: %s\n", p); > > >> #simfiles[pindex] = runSimulation(p); > > >> } > > >> } > > >> > > >> # Analyze Jobs > > >> > > >> # Generate Prediction > > >> > > >> > > >> > > >> # creates an array of datafiles named swifttest..out > > >> to > > >> write to > > >> file out[] > >> prefix=@strcat("swifttest.",passes,"."),suffix=".out">; > > >> > > >> # creates a default of 10 files > > >> foreach j in [1:@toInt(@arg("n","10"))] { > > >> file data<"data.txt">; > > >> out[j] = cat2(data); > > >> } > > >> > > >> # try writing the iteration to a log file > > >> file passlog <"passes.log">; > > >> passlog = writeData(passes); > > >> > > >> # try reading from another log file > > >> int readpasses = readData(passlog); > > >> > > >> # Write to the Output Log > > >> tracef("%s: %i\n", "Iteration :", passes); > > >> tracef("%s: %i\n", "Iteration Read :", readpasses); > > >> > > >> #} until (readpasses == 2); # Determine if Done > > >> } until (passes == 1); # Determine if Done > > >> > > >> > > >> On Sep 1, 2012, at 1:57 PM, Mihael Hategan > > >> wrote: > > >> > > >>> Can you post the entire script? > > >>> > > >>> On Sat, 2012-09-01 at 12:29 -0500, Carolyn Phillips wrote: > > >>>> Yes, I tried that > > >>>> > > >>>> unlabeleddata pl = np.points; > > >>>> string parameters[] =readData(pl); > > >>>> > > >>>> > > >>>> and I got > > >>>> > > >>>> Execution failed: > > >>>> mypoints..dat (No such file or directory) > > >>>> > > >>>> On Aug 31, 2012, at 8:27 PM, Mihael Hategan > > >>>> > > >>>> wrote: > > >>>> > > >>>>> On Fri, 2012-08-31 at 20:11 -0500, Carolyn Phillips wrote: > > >>>>>> How would this line work for what I have below? > > >>>>>> > > >>>>>>>> string parameters[] =readData(np.points); > > >>>>>> > > >>>>> > > >>>>> unlabeleddata tmp = np.points; > > >>>>> string parameters[] = readData(tmp); > > >>>>> > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> On Aug 31, 2012, at 7:49 PM, Mihael Hategan > > >>>>>> wrote: > > >>>>>> > > >>>>>>> Another bug. > > >>>>>>> > > >>>>>>> I committed a fix. In the mean time, the solution is: > > >>>>>>> > > >>>>>>> > > >>>>>>> errorlog fe = np.errorlog; > > >>>>>>> > > >>>>>>> int error = readData(fe); > > >>>>>>> > > >>>>>>> On Fri, 2012-08-31 at 19:29 -0500, Carolyn Phillips wrote: > > >>>>>>>> Hi Mihael, > > >>>>>>>> > > >>>>>>>> the reason I added the "@" was because > > >>>>>>>> > > >>>>>>>> now this (similar) line > > >>>>>>>> > > >>>>>>>> if(checkforerror==0) { > > >>>>>>>> string parameters[] =readData(np.points); > > >>>>>>>> } > > >>>>>>>> > > >>>>>>>> gives me this: > > >>>>>>>> > > >>>>>>>> Execution failed: > > >>>>>>>> mypoints..dat (No such file or directory) > > >>>>>>>> > > >>>>>>>> as in now its not getting the name of the file correct > > >>>>>>>> > > >>>>>>>> On Aug 31, 2012, at 7:17 PM, Mihael Hategan > > >>>>>>>> wrote: > > >>>>>>>> > > >>>>>>>>> @np.error means the file name of np.error which is known > > >>>>>>>>> statically. So > > >>>>>>>>> readData(@np.error) can run as soon as the script starts. > > >>>>>>>>> > > >>>>>>>>> You probably want to say readData(np.error). > > >>>>>>>>> > > >>>>>>>>> Mihael > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> On Fri, 2012-08-31 at 18:55 -0500, Carolyn Phillips wrote: > > >>>>>>>>>> So I execute an atomic procedure to generate a datafile, > > >>>>>>>>>> and then next > > >>>>>>>>>> I want to do something with that data file. However, my > > >>>>>>>>>> program is > > >>>>>>>>>> trying to do something with the datafile before it has > > >>>>>>>>>> been > > >>>>>>>>>> written > > >>>>>>>>>> to. So something with order of execution is not working. > > >>>>>>>>>> I > > >>>>>>>>>> think the > > >>>>>>>>>> problem is that the name of my file exists, but the file > > >>>>>>>>>> itself does > > >>>>>>>>>> not yet, but execution proceeds anyway! > > >>>>>>>>>> > > >>>>>>>>>> Here are my lines > > >>>>>>>>>> > > >>>>>>>>>> type pointfile { > > >>>>>>>>>> unlabeleddata points; > > >>>>>>>>>> errorlog error; > > >>>>>>>>>> } > > >>>>>>>>>> > > >>>>>>>>>> # Generate Parameters > > >>>>>>>>>> pointfile np > > >>>>>>>>>> ; > > >>>>>>>>>> np = generatepoints(config,labeledpoints, "uniform", 50); > > >>>>>>>>>> > > >>>>>>>>>> int checkforerror = readData(@np.error); > > >>>>>>>>>> > > >>>>>>>>>> This gives an error : > > >>>>>>>>>> mypoints.error.dat (No such file or directory) > > >>>>>>>>>> > > >>>>>>>>>> If I comment out the last line.. all the files show up in > > >>>>>>>>>> the directory. (e.g. mypoints.points.dat and > > >>>>>>>>>> mypoints.error.dat) ) and if forget to remove the .dat > > >>>>>>>>>> files from a prior run, it also runs fine! > > >>>>>>>>>> > > >>>>>>>>>> How do you fix a problem like that? > > >>>>>>>>>> > > >>>>>>>>>> _______________________________________________ > > >>>>>>>>>> Swift-user mailing list > > >>>>>>>>>> Swift-user at ci.uchicago.edu > > >>>>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>>> > > >>>> > > >>> > > >>> > > >> > > > > > > > > > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From iraicu at cs.iit.edu Fri Sep 7 15:48:05 2012 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Fri, 07 Sep 2012 15:48:05 -0500 Subject: [Swift-user] Extended Deadline: 5th Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS) 2012 -- at IEEE/ACM Supercomputing 2012 Message-ID: <504A5D85.1000909@cs.iit.edu> Call for Papers --------------------------------------------------------------------------------------- The 5th Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS) 2012 http://datasys.cs.iit.edu/events/MTAGS12/ --------------------------------------------------------------------------------------- November 12th, 2012 Salt Lake City, Utah, USA Co-located with with IEEE/ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SC12) ======================================================================================= The 5th workshop on Many-Task Computing on Grids and Supercomputers (MTAGS) will provide the scientific community a dedicated forum for presenting new research, development, and deployment efforts of large-scale many-task computing (MTC) applications on large scale clusters, Grids, Supercomputers, and Cloud Computing infrastructure. MTC, the theme of the workshop encompasses loosely coupled applications, which are generally composed of many tasks (both independent and dependent tasks) to achieve some larger application goal. This workshop will cover challenges that can hamper efficiency and utilization in running applications on large-scale systems, such as local resource manager scalability and granularity, efficient utilization of raw hardware, parallel file system contention and scalability, data management, I/O management, reliability at scale, and application scalability. We welcome paper submissions on all theoretical, simulations, and systems topics related to MTC, but we give special consideration to papers addressing petascale to exascale challenges. Papers will be peer-reviewed, and accepted papers will be published in the workshop proceedings as part of the IEEE digital library (pending approval). The workshop will be co-located with the IEEE/ACM Supercomputing 2012 Conference in Salt Lake City Utah on November 12th, 2012. For more information, please see http://datasys.cs.iit.edu/events/MTAGS12/. For more information on past workshops, please see MTAGS11, MTAGS10, MTAGS09, and MTAGS08. We also ran a Special Issue on Many-Task Computing in the IEEE Transactions on Parallel and Distributed Systems (TPDS) which has appeared in June 2011; the proceedings can be found online at http://www.computer.org/portal/web/csdl/abs/trans/td/2011/06/ttd201106toc.htm. We, the workshop organizers, also published a highly relevant paper that defines Many-Task Computing which was published in MTAGS08, titled "Many-Task Computing for Grids and Supercomputers" (http://www.cs.iit.edu/~iraicu/research/publications/2008_MTAGS08_MTC.pdf); we encourage potential authors to read this paper. Topics --------------------------------------------------------------------------------------- We invite the submission of original work that is related to the topics below. The papers can be either short (4 pages) position papers, or long (8 pages) research papers. Topics of interest include (in the context of Many-Task Computing): * Compute Resource Management * Scheduling * Job execution frameworks * Local resource manager extensions * Performance evaluation of resource managers in use on large scale systems * Dynamic resource provisioning * Techniques to manage many-core resources and/or GPUs * Challenges and opportunities in running many-task workloads on HPC systems * Challenges and opportunities in running many-task workloads on Cloud Computing infrastructure * Storage architectures and implementations * Distributed file systems * Parallel file systems * Distributed meta-data management * Content distribution systems for large data * Data caching frameworks and techniques * Data management within and across data centers * Data-aware scheduling * Data-intensive computing applications * Eventual-consistency storage usage and management * Programming models and tools * Map-reduce and its generalizations * Many-task computing middleware and applications * Parallel programming frameworks * Ensemble MPI techniques and frameworks * Service-oriented science applications * Large-Scale Workflow Systems * Workflow system performance and scalability analysis * Scalability of workflow systems * Workflow infrastructure and e-Science middleware * Programming Paradigms and Models * Large-Scale Many-Task Applications * High-throughput computing (HTC) applications * Data-intensive applications * Quasi-supercomputing applications, deployments, and experiences * Performance Evaluation * Performance evaluation * Real systems * Simulations * Reliability of large systems * How MTC Addresses Challenges of Petascale and Exascale Computing * Concurency & Programmability * I/O & Memory * Energy * Resilience * Heterogeneity Paper Submission and Publication --------------------------------------------------------------------------------------- Authors are invited to submit papers with unpublished, original work of not more than 8 pages of double column text using single spaced 10 point size on 8.5 x 11 inch pages, as per IEEE 8.5 x 11 manuscript guidelines; document templates can be found at http://www.ieee.org/conferences_events/conferences/publishing/templates.html. We are also seeking position papers of no more than 4 pages in length. The final 4/8 page papers in PDF format must be submitted online at https://cmt.research.microsoft.com/MTAGS2012/ before the deadline. Papers will be peer-reviewed, and accepted papers will be published in the workshop proceedings as part of the IEEE digital library (pending approval). Notifications of the paper decisions will be sent out by October 12th, 2012. Selected excellent work may be eligible for additional post-conference publication as journal articles or book chapters, such as the previous Special Issue on Many-Task Computing in the IEEE Transactions on Parallel and Distributed Systems (TPDS) which has appeared in June 2011. Submission implies the willingness of at least one of the authors to register and present the paper. For more information, please http://datasys.cs.iit.edu/events/MTAGS12/, or send email to mtags12-chairs at datasys.cs.iit.edu. Important Dates --------------------------------------------------------------------------------------- * Abstract submission: September 10th, 2012 (11:59PM PST) * Paper submission: September 24th, 2012 (11:59PM PST) * Acceptance notification: October 12th, 2012 * Final papers due: November 7th, 2012 Committee Members --------------------------------------------------------------------------------------- Workshop Chairs (mtags12-chairs at datasys.cs.iit.edu) * Ioan Raicu, Illinois Institute of Technology & Argonne National Laboratory * Ian Foster, University of Chicago & Argonne National Laboratory * Yong Zhao, University of Electronic Science and Technology of China Steering Committee * David Abramson, Monash University, Australia * Jack Dongara, University of Tennessee, USA * Geoffrey Fox, Indiana University, USA * Manish Parashar, Rutgers University, USA * Marc Snir, Argonne National Laboratory & University of Illinois at Urbana Champaign, USA * Xian-He Sun, Illinois Institute of Technology, USA * Weimin Zheng, Tsinghua University, China Publicity Chair (mtags12-publicity at datasys.cs.iit.edu) * Zhao Zhang, University of Chicago, USA Program Committee Chair (mtags12-pc-chair at datasys.cs.iit.edu) * Justin Wozniak, Argonne National Laboratory, USA Technical Committee * Roger Barga, Microsoft Research, USA * Mihai Budiu, Microsoft Research, USA * Kyle Chard, University of Chicago, USA * Yong Chen, Texas Tech University, USA * Evangelinos Constantinos, Massachusetts Institute of Technology, USA * John Dennis, National Center for Atmospheric Research, USA * Catalin Dumitrescu, Fermi National Labs, USA * Dennis Gannon, Microsoft Research, USA * Indranil Gupta, University of Illinois at Urbana Champaign, USA * Florin Isaila, Universidad Carlos III de Madrid, Spain * Kamil Iskra, Argonne National Laboratory, USA * Alexandru Iosup, Delft University of Technology, Netherlands * Hui Jin, Oracle Corporation, USA * Daniel S. Katz, University of Chicago & Argonne National Laboratory, USA * Carl Kesselman, University of Southern California, USA * Zhiling Lan, Illinois Institute of Technology, USA * Mike Lang, Los Alamos National Laboratory, USA * Gregor von Laszewski, Indiana University, USA * Reagan Moore, University of North Carolina, Chappel Hill, USA * Jose Moreira, IBM Research, USA * Chris Moretti, Princeton University, USA * David O'Hallaron, Carnegie Mellon University, Intel Labs, USA * Marlon Pierce, Indiana University, USA * Judy Qiu, Indiana University, USA * Lavanya Ramakrishnan, Lawrence Berkeley National Laboratory, USA * Kui Ren, SUNY Buffalo, USA * Matei Ripeanu, University of British Columbia, Canada * Karen Schuchardt, Pacific Northwest National Laboratory, USA * Wei Tang, Argonne National Laboratory, USA * Valerie Taylor, Texas A&M, USA * Douglas Thain University of Notre Dame, USA * Edward Walker, Whitworth University, USA * Matthew Woitaszek, Occipital, Inc., USA * Ken Yocum, University of California, San Diego, USA * Zhifeng Yun, Louisiana State University, USA * Zhao Zhang, University of Chicago, USA * Ziming Zheng, Illinois Institute of Technology, USA -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ Web: http://datasys.cs.iit.edu/ ================================================================= ================================================================= From iraicu at cs.iit.edu Fri Sep 7 17:13:20 2012 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Fri, 07 Sep 2012 17:13:20 -0500 Subject: [Swift-user] eScience 2012 Program now available Message-ID: <504A7180.9080600@cs.iit.edu> The Conference Program for the 8th IEEE International Conference on eScience (eScience 2012) is now available at http://www.ci.uchicago.edu/escience2012/. All eScience conference and workshops will be held at the Hyatt Regency Chicago. We have reserved a block of rooms at the hotel with a room rate of $205 per night + tax. Please make your reservations before September 16 at https://resweb.passkey.com/go/ESCIENCE2012 or by phone at +1-800-233-1234 from the US or +1-312-565-1234 from outside the US. Reference "e-Science" when making reservations by phone. The hotel is completely sold out on Sunday night (10/7) due to major events being held in Chicago over the weekend. We have secured a block of rooms at the Hyatt Regency O'Hare for $190 + tax and will provide transportation to the conference hotel on Monday morning. You will be able to make your reservation at the Hyatt O'Hare upon registration for the conference or via email to kristih at ci.uchicago.edu . Advance registration is available until September 28 - see http://www.ci.uchicago.edu/escience2012/registration.php. ACKNOWLEDGEMENTS-- We gratefully acknowledge the support of the following organizations-- Gold Level: Microsoft Research Silver Level: CSIRO Australia Bronze Level: Argonne National Laboratory EMC Indiana University Media: HPCwire Also: The Computation Institute Cray Conference Sponsors: IEEE The IEEE Computer Society -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ Web: http://datasys.cs.iit.edu/ ================================================================= ================================================================= -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.iit.edu Fri Sep 7 19:11:07 2012 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Fri, 07 Sep 2012 19:11:07 -0500 Subject: [Swift-user] Call for Participation: The 9th ICAC 2012 Conference - September 17-21 in San Jose CA Message-ID: <504A8D1B.5000800@cs.iit.edu> ********************************************************************** CALL FOR PARTICIPATION ====================== The 9th ACM International Conference on Autonomic Computing (ICAC 2012) San Jose, California, USA September 17-21, 2012 http://icac2012.cs.fiu.edu Sponsored by ACM ********************************************************************** Online registration is open at http://icac2012.cs.fiu.edu/registration.shtm Reduced fees are available for those registering by September 9, 2012 (extended). This year's technical program features: - 3 distinguished keynote speakers: * Dr. Amin Vahdat, Google/UCSD * Dr. Subutai Ahmad (VP Engineering), Numenta * Dr. Eitan Frachtenberg, Facebook - 24 outstanding technical papers (15 full + 9 short): * covering core and emerging topics such as clouds, virtualization, control, monitoring and diagnosis, and energy * Half of the papers involve authors from industry or government labs - 4 co-located workshops covering hot topics in: * Feedback Computing * Self-Aware Internet of Things * Management of Big Data Systems * Federated Clouds The Conference will be held at the Fairmont Hotel in downtown San Jose, CA, USA. ********************************************************************** IMPORTANT DATES =============== Early registration deadline: September 9, 2012 (extended) Hotel special rate deadline: September 7, 2012 ********************************************************************** CORPORATE SPONSORS ================== Gold Level Partner: IBM Conference partners: VMware, HP, Neustar PhD Student Sponsor: Google Other Sponsor: NEC Labs, Microsoft ********************************************************************** PROGRAM ======= ====================================================================== MONDAY, SEPTEMBER 17, 2012 - WORKSHOPS Feedback Computing 2012 Self-Aware Internet of Things 2012 ====================================================================== TUESDAY, SEPTEMBER 18, 2012 - MAIN CONFERENCE 8:00AM ? 8:45AM Registration 8:45AM ? 9:00AM Welcome Remarks 9:00AM ? 10:00AM Keynote Talk I Symbiosis in Scale Out Networking and Data Management Amin Vahdat, Google and University of California, San Diego 10:00AM ? 10.30AM Break 10:30AM ? 12:00PM Session : Virtualization Net-Cohort: Detecting and Managing VM Ensembles in Virtualized Data Centers Liting Hu, Karsten Schwan (Georgia Institute of Technology); Ajay Gulati (VMware); Junjie Zhang, Chengwei Wang (Georgia Institute of Technology) Application-aware Cross-layer Virtual Machine Resource Management Lixi Wang, Jing Xu, Ming Zhao (Florida International University) Shifting GEARS to Enable Guest-context Virtual Services Kyle Hale, Lei Xia, Peter Dinda (Northwestern University) 12:00PM-1:30PM Lunch ---------------------------------------------------------------------- 1:30PM - 3:30PM Session: Performance and Resource Management When Average is Not Average: Large Response Time Fluctuations in N-tier Systems Qingyang Wang (Georgia Institute of Technology); Yasuhiko Kanemasa, Motoyuki Kawaba (Fujitsu Laboratories Ltd.); Calton Pu (Georgia Institute of Technology) Provisioning Multi-tier Cloud Applications Using Statistical Bounds on Sojourn Time Upendra Sharma, Prashant Shenoy, Don Towsley (University of Massachusetts Amherst) Automated Profiling and Resource Management of Pig Programs for Meeting Service Level Objectives Zhuoyao Zhang (University of Pennsylvania); Ludmila Cherkasova (Hewlett-Packard Labs); Abhishek Verma (University of Illinois at Urbana-Champaign); Boon Thau Loo (University of Pennsylvania) AROMA: Automated Resource Allocation and Configuration of MapReduce Environment in the Cloud Palden Lama, Xiaobo Zhou (University of Colorado at Colorado Springs) 3:30PM-4:00PM Break 4:00PM ? 5:15PM Short Papers I Locomotion at Location: When the Rubber hits the Road Gerold Hoelzl, Marc Kurz, Alois Ferscha (Johannes Kepler University Linz, Austria) An Autonomic Resource Provisioning Framework for Mobile Computing Grids Hariharasudhan Viswanathan, Eun Kyung Lee, Ivan Rodero, Dario Pompili (Rutgers University) A Self-Tuning Self-Optimizing Approach for Automated Network Anomaly Detection Systems Dennis Ippoliti, Xiaobo Zhou (University of Colorado at Colorado Springs) Offline and On-Demand Event Correlation for Operations Management of Large Scale IT Systems Chetan Gupta (Hewlett-Packard Labs) PowerTracer: Tracing Requests in Multi-tier Services to Diagnose Energy Inefficiency Gang Lu, Jianfeng Zhan (Institute of Computing Technology, Chinese Academy of Sciences); Haining Wang (College of William and Mary); Lin Yuan (Institute of Computing Technology, Chinese Academy of Sciences); Chuliang Weng (Shanghai Jiao Tong University, China) 6:00PM - 9:00PM Conference Dinner ====================================================================== WEDNESDAY, SEPTEMBER 19 - MAIN CONFERENCE 8:00AM ? 9:00AM Registration 9:00AM ? 10:00AM Keynote Talk II Automated Machine Learning For Autonomic Computing Subutai Ahmad, VP Engineering, Numenta 10:00AM - 10:30AM Break 10:30AM ? 12:00PM Session: Control-Based Approaches Budget-based Control for Interactive Services with Adaptive Execution Yuxiong He, Zihao Ye, Qiang Fu, Sameh Elnikety (Microsoft Research) On the Design of Decentralized Control Architectures for Workload Consolidation in Large-Scale Server Clusters Rui Wang, Nagarajan Kandasamy (Drexel University) Transactional Auto Scaler: Elastic Scaling of In-Memory Transactional Data Grids Diego Didona, Paolo Romano (Instituto Superior T?cnico/INESC-ID); Sebastiano Peluso, Francesco Quaglia (Sapienza, Universit? di Roma) 12:00PM - 1:30PM Lunch ---------------------------------------------------------------------- 1:30PM - 2:30PM Session: Energy Adaptive Green Hosting Nan Deng, Christopher Stewart, Jaimie Kelley (The Ohio State University); Daniel Gmach, Martin Arlitt (Hewlett Packard Labs) Dynamic Energy-Aware Capacity Provisioning for Cloud Computing Environments Qi Zhang, Mohamed Faten Zhani (University of Waterloo); Shuo Zhang (National University of Defense Technology); Quanyan Zhu (University of Illinois at Urbana-Champaign); Raouf Boutaba (University of Waterloo); Joseph L. Hellerstein (Google, Inc.) 2:30PM - 3:30PM Short Papers II VESPA: Multi-Layered Self-Protection for Cloud Resources Aur?lien Wailly, Marc Lacoste (Orange Labs); Herv? Debar (T?l?com SudParis) Usage Patterns in Multi-tenant Data Centers: a Temporal Perspective Robert Birke, Lydia Y. Chen (IBM Research Zurich Lab); Evgenia Smirni (College of William and Mary) Toward Fast Eventual Consistency with Performance Guarantees Feng Yan (College of William and Mary); Alma Riska (EMC Corporation); Evgenia Smirni (College of William and Mary) Optimal Autoscaling in the IaaS Cloud Hamoun Ghanbari, Bradley Simmons, Marin Litoiu, Cornel Barna (York University); Gabriel Iszlai (IBM Toronto) 3:30PM ? 4:00PM Break 4:00PM - 6:00PM Poster and Demo Session 6:00PM-9:00PM Conference Outing (tentative) ====================================================================== THURSDAY, SEPTEMBER 20, 2012 - MAIN CONFERENCE 8:00AM ? 9:00AM Registration 9:00AM ? 10:00AM Keynote Talk III High Efficiency at Web Scale Eitan Frachtenberg, Facebook 10:00AM-10:30AM Break 10:30AM - 12:00PM Session: Diagnosis and Monitoring Chair: TBD 3-Dimensional Root Cause Diagnosis via Co-analysi Ziming Zheng, Li Yu, Zhiling Lan (Illinois Institute of Technology); Terry Jones (Oak Ridge National Laboratory) UBL: Unsupervised Behavior Learning for Predicting Performance Anomalies in Virtualized Cloud Systems Daniel J. Dean, Hiep Nguyen, Xiaohui Gu (North Carolina State University) Evaluating Compressive Sampling Strategies for Performance Monitoring of Data Centers Tingshan Huang, Nagarajan Kandasamy, Harish Sethu (Drexel University) 12:00PM Adjourn ====================================================================== FRIDAY, SEPTEMBER 21, 2012 - WORKSHOPS Management of Big Data Systems 2012 Federated Clouds 2012 ********************************************************************** ORGANIZERS ========== GENERAL CHAIR Dejan Milojicic, HP Labs PROGRAM CHAIRS Dongyan Xu, Purdue University Vanish Talwar, HP Labs INDUSTRY CHAIR Xiaoyun Zhu, VMware WORKSHOPS CHAIR Fred Douglis, EMC POSTERS/DEMO/EXHIBITS CHAIR Eno Thereska, Microsoft Research FINANCE CHAIR Michael Kozuch, Intel LOCAL ARRANGEMENT CHAIR Jessica Blaine PUBLICITY CHAIRS Daniel Batista, University of S?o Paulo Vartan Padaryan, ISP/Russian Academy of Sci. Ioan Raicu, Illinois Inst. of Technology Jianfeng Zhan, ICT/Chinese Academy of Sci. Ming Zhao, Florida Intl. University PROGRAM COMMITTEE Tarek Abdelzaher, UIUC Umesh Bellur, IIT, Bombay Ken Birman, Cornell University Rajkumar Buyya, Univ. of Melbourne Rocky Chang, Hong Kong Polytechnic University Yuan Chen, HP Labs Alva Couch, Tufts University Peter Dinda, Northwestern University Fred Douglis, EMC Renato Figueiredo, University of Florida Mohamed Hefeeda, QCRI Joe Hellerstein, Google Geoff Jiang, NEC Labs Jeff Kephart, IBM Research Emre Kiciman, Microsoft Research Fabio Kon, University of S?o Paulo Mike Kozuch, Intel Labs Dejan Milojicic, HP Labs Klara Nahrstedt, UIUC Priya Narasimhan, CMU Manish Parashar, Rutgers University Ioan Raicu, Illinois Inst. of Technology Omer Rana, Cardiff University Masoud Sadjadi, Florida Intl. University Richard Schlichting, AT&T Labs Hartmut Schmeck, KIT Karsten Schwan, Georgia Tech Onn Shehory, IBM Research Eno Thereska, Microsoft Research Xiaoyun Zhu, VMware ********************************************************************** -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ Web: http://datasys.cs.iit.edu/ ================================================================= ================================================================= From lpesce at uchicago.edu Fri Sep 7 19:50:44 2012 From: lpesce at uchicago.edu (Lorenzo Pesce) Date: Fri, 7 Sep 2012 19:50:44 -0500 Subject: [Swift-user] Question about using ramdisk Message-ID: <240A2648-DF48-429C-959A-D220F9A1B5B7@uchicago.edu> Dear all, I just realized that I locked myself into a design corner. We use Matlab executables run by a swift wrapper. I wanted to use ramdisk (/dev/shm or whatever works on each node) because our lustre filesystem isn't liking too much to be poked all the time and it is making sure we know that. I planned to move there. I just realized that I have no idea how to tell swift that this is where it should pick up the output files from the run. The easy fix is to simply add a mv statement (or whatever works best, we'll have to try) to the wrapper scripts, since while they will have a different source (the local /dev/shm) they have the same target (/lustre), but I was wondering if Swift has a direct way to handle that. I am asking because the script will be asked to run on different systems where swift will know where it was started and where the wrappers are being run, the same is not true for the wrappers. Any ideas? Thanks a lot, Lorenzo From wilde at mcs.anl.gov Sat Sep 8 19:12:43 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 8 Sep 2012 19:12:43 -0500 (CDT) Subject: [Swift-user] Question about using ramdisk In-Reply-To: <240A2648-DF48-429C-959A-D220F9A1B5B7@uchicago.edu> Message-ID: <1450668456.6879.1347149563204.JavaMail.root@zimbra.anl.gov> Dear Lorenzo, We'll need a bit more information on your data flow pattern t help you with this. In general, the strategy is to move as much IO from shared filesystems to filesystems that are local to each node. If your data flow pattern is one on which you create many independent datasets on the nodes that eventually need to read by a single app invocation (e.g. one that needs to reduce or summarize all the outputs). Im assuming this is what you meant by "how to tell swift that this is where it should pick up the output files from the run"? One way to do this is to use provider staging, which will send the output files back to the submit host. Another strategy is to use CDM "direct" mode to *efficiently* write output data to a shared filesystem (eg lustre) location A third way - not yet well tested - is to use CDM "gather" mode to collect the output files into tar archives, one archive per node, which will reduce the number of files that need to be written to a shared location. Rather than trying to learn what youre trying to do with a long thread on the swift-user list, lets switch to swift-support or an in-person meeting, and then report the final suggestions back to swift-user and document them in the User Guide. - Mike ----- Original Message ----- > From: "Lorenzo Pesce" > To: swift-user at ci.uchicago.edu > Sent: Friday, September 7, 2012 7:50:44 PM > Subject: [Swift-user] Question about using ramdisk > Dear all, > > I just realized that I locked myself into a design corner. > > We use Matlab executables run by a swift wrapper. > > I wanted to use ramdisk (/dev/shm or whatever works on each node) > because our lustre filesystem isn't liking too much to be poked all > the time and it is making sure we know that. I planned to move there. > I just realized that I have no idea how to tell swift that this is > where it should pick up the output files from the run. > > The easy fix is to simply add a mv statement (or whatever works best, > we'll have to try) to the wrapper scripts, since while they will have > a different source (the local /dev/shm) they have the same target > (/lustre), but I was wondering if Swift has a direct way to handle > that. I am asking because the script will be asked to run on different > systems where swift will know where it was started and where the > wrappers are being run, the same is not true for the wrappers. > > Any ideas? > > Thanks a lot, > > Lorenzo > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From cphillips at mcs.anl.gov Mon Sep 10 13:44:33 2012 From: cphillips at mcs.anl.gov (Carolyn Phillips) Date: Mon, 10 Sep 2012 13:44:33 -0500 Subject: [Swift-user] Output Files, ReadData and Order of Execution In-Reply-To: <1228936297.5835.1347047563931.JavaMail.root@zimbra.anl.gov> References: <1228936297.5835.1347047563931.JavaMail.root@zimbra.anl.gov> Message-ID: Hey Mike. This worked, but I still had to add the extra "/" -Carolyn On Sep 7, 2012, at 2:52 PM, Michael Wilde wrote: > Carolyn, to follow up on your question: below is an example of using the "direct" file access mode. > > - Mike > > #----- The swift script > > $ cat catsndirect.swift > > type file; > > app (file o) cat (file i) > { > cat @i stdout=@o; > } > > file out[]; > > foreach j in [1:@toint(@arg("n","1"))] { > file data<"/tmp/wilde/indir/data.txt">; > out[j] = cat(data); > } > > #----- The "cdm" file: > > $ cat direct > rule .* DIRECT / > > #----- The command line: > > $ swift -config cf -cdm.file direct -tc.file tc -sites.file sites.xml catsndirect.swift -n=10 > > #----- The output and input dirs: > > $ ls -lr /tmp/wilde/{in,out}dir > > /tmp/wilde/outdir: > > total 40 > -rw-r--r-- 1 wilde ci-users 8 Sep 7 14:11 f.0010.out > -rw-r--r-- 1 wilde ci-users 8 Sep 7 14:11 f.0009.out > -rw-r--r-- 1 wilde ci-users 8 Sep 7 14:11 f.0008.out > -rw-r--r-- 1 wilde ci-users 8 Sep 7 14:11 f.0007.out > -rw-r--r-- 1 wilde ci-users 8 Sep 7 14:11 f.0006.out > -rw-r--r-- 1 wilde ci-users 8 Sep 7 14:11 f.0005.out > -rw-r--r-- 1 wilde ci-users 8 Sep 7 14:11 f.0004.out > -rw-r--r-- 1 wilde ci-users 8 Sep 7 14:11 f.0003.out > -rw-r--r-- 1 wilde ci-users 8 Sep 7 14:11 f.0002.out > -rw-r--r-- 1 wilde ci-users 8 Sep 7 14:11 f.0001.out > > /tmp/wilde/indir: > > total 4 > -rw-r--r-- 1 wilde ci-users 8 Sep 7 13:47 data.txt > com$ > > > ----- Original Message ----- >> From: "Michael Wilde" >> To: "Carolyn Phillips" >> Cc: swift-user at ci.uchicago.edu >> Sent: Wednesday, September 5, 2012 8:40:28 AM >> Subject: Re: [Swift-user] Output Files, ReadData and Order of Execution >> Hi Carolyn, >> >> I think this error is due to the fact that the launchjob script is not >> coded to match Swift's file management conventions. >> >> Unless you declare that you want to use "Direct" file management, >> Swift will expect output files to be created relative to the directory >> in which it runs your app() scripts. Thats why it was pulling off the >> leading "/". By putting a // at the front of the pathname, you were >> inadvertantly causing your launchjob script to place its output file >> in a different directory than where Swift was expecting it. >> >> There's a further mismatch I think between the mapped filename from >> simple_mapper (which defaults to 4-digit strings for indices) and the >> names that launchjob is trying to create. >> >> I'll need to send further clarification later, but for now, could you >> try the following: >> >> - go back to using a single leading "/" >> - comment out the mkdir and cd in launchjob, as $3 contains the >> correct pathname to write to (which will be a long relative pathname >> without the leaning "/") >> >> I think what you really want here is "DIRECT" file management mode, >> explained at: >> >> http://www.ci.uchicago.edu/swift/guides/trunk/userguide/userguide.html#_policy_descriptions >> >> We need to enhance the User Guide to explain this clearly and fully. >> >> - Mike >> >> ----- Original Message ----- >>> From: "Carolyn Phillips" >>> To: "Mihael Hategan" >>> Cc: swift-user at ci.uchicago.edu >>> Sent: Sunday, September 2, 2012 10:16:05 PM >>> Subject: Re: [Swift-user] Output Files, ReadData and Order of >>> Execution >>> You are right. That was my problem. (dumb!) >>> >>> Anyway. My next issue is that Swift is telling me it can't find a >>> file >>> that exists. Perhaps this is because it does not understand absolute >>> directory paths the way I am specifying them? >>> >>> The short version is that, using a simple mapper, I specify the >>> location as ;location="//scratch/midway/phillicl/SwiftJobs/" Note >>> that >>> I had to put two // at the beginning because the first backslash >>> gets >>> removed for some reason. Then I pass that file name to a script >>> write >>> to that file. But then Swift doesn't see the file >>> >>> Here is the more detailed version >>> >>> I have a script called launch jobs that does the following >>>> cd /scratch/midway/phillicl/SwiftJobs >>>> >>>> mkdir Job.${1}.${2} >>>> cd Job.${1}.${2} >>>> >>>> # Copy in some files and do some work >>>> >>>> pwd > ${3} >>> >>> >>> >>> Here is the swift script >>> >>>> # Types >>>> type file; >>>> type unlabeleddata; >>>> type labeleddata; >>>> type errorlog; >>>> >>>> # Structured Types >>>> type pointfile { >>>> unlabeleddata points; >>>> errorlog error; >>>> } >>>> >>>> type simulationfile { >>>> file output; >>>> } >>>> >>>> # Apps >>>> app (file o) cat (file i) >>>> { >>>> cat @i stdout=@o; >>>> } >>>> >>>> app (file o) cat2 (file i) >>>> { >>>> systeminfo stdout=@o; >>>> } >>>> >>>> app (pointfile o) generatepoints (file c, labeleddata f, string >>>> mode, int Npoints) >>>> { >>>> matlab_callgeneratepoints @c @f mode Npoints @o.points @o.error; >>>> } >>>> >>>> app (simulationfile o) runSimulation(string p,int passes, int >>>> pindex) >>>> { >>>> launchjob passes pindex @o.output; >>>> } >>>> >>>> #Files (using single file mapper) >>>> file config <"designspace.config">; >>>> labeleddata labeledpoints <"emptypoints.dat">; >>>> >>>> type pointlog; >>>> >>>> # Loop >>>> iterate passes { >>>> >>>> # Generate Parameters >>>> pointfile np ; >>>> np = generatepoints(config,labeledpoints, "uniform", 50); >>>> >>>> errorlog fe = np.error; >>>> int checkforerror = readData(fe); >>>> tracef("%s: %i\n", "Generate Parameters Error Value", >>>> checkforerror); >>>> >>>> # Issue Jobs >>>> simulationfile simfiles[] >>>> ; >>>> if(checkforerror==0) { >>>> unlabeleddata pl = np.points; >>>> string parameters[] =readData(pl); >>>> foreach p,pindex in parameters { >>>> tracef("Launch Job for Parameters: %s\n", p); >>>> simfiles[pindex] = runSimulation(p,passes,pindex); >>>> } >>>> } >>>> >>>> # Analyze Jobs >>>> >>>> # Generate Prediction >>>> >>>> >>>> >>>> # creates an array of datafiles named swifttest..out >>>> to >>>> write to >>>> file out[]>>> prefix=@strcat("swifttest.",passes,"."),suffix=".out">; >>>> >>>> # creates a default of 10 files >>>> foreach j in [1:@toInt(@arg("n","10"))] { >>>> file data<"data.txt">; >>>> out[j] = cat2(data); >>>> } >>>> >>>> # try writing the iteration to a log file >>>> file passlog <"passes.log">; >>>> passlog = writeData(passes); >>>> >>>> # try reading from another log file >>>> int readpasses = readData(passlog); >>>> >>>> # Write to the Output Log >>>> tracef("%s: %i\n", "Iteration :", passes); >>>> tracef("%s: %i\n", "Iteration Read :", readpasses); >>>> >>>> #} until (readpasses == 2); # Determine if Done >>>> } until (passes == 1); # Determine if Done >>> >>> >>> And Here is the error I get: >>> >>> EXCEPTION Exception in launchjob: >>> Arguments: [0, 1, >>> /scratch/midway/phillicl/SwiftJobs/0.0001.output.job] >>> Host: pbs >>> Directory: test-20120903-0303-pedfpqu8/jobs/f/launchjob-fff1mjxk >>> stderr.txt: >>> stdout.txt: >>> ---- >>> >>> sys:exception @ vdl-int.k, line: 601 >>> sys:throw @ vdl-int.k, line: 600 >>> sys:catch @ vdl-int.k, line: 567 >>> sys:try @ vdl-int.k, line: 469 >>> task:allocatehost @ vdl-int.k, line: 419 >>> vdl:execute2 @ execute-default.k, line: 23 >>> sys:ignoreerrors @ execute-default.k, line: 21 >>> sys:parallelfor @ execute-default.k, line: 20 >>> sys:restartonerror @ execute-default.k, line: 16 >>> sys:sequential @ execute-default.k, line: 14 >>> sys:try @ execute-default.k, line: 13 >>> sys:if @ execute-default.k, line: 12 >>> sys:then @ execute-default.k, line: 11 >>> sys:if @ execute-default.k, line: 10 >>> vdl:execute @ test.kml, line: 182 >>> run_simulation @ test.kml, line: 480 >>> sys:parallel @ test.kml, line: 465 >>> foreach @ test.kml, line: 456 >>> sys:parallel @ test.kml, line: 427 >>> sys:then @ test.kml, line: 409 >>> sys:if @ test.kml, line: 404 >>> sys:sequential @ test.kml, line: 402 >>> sys:parallel @ test.kml, line: 315 >>> iterate @ test.kml, line: 229 >>> vdl:sequentialwithid @ test.kml, line: 226 >>> vdl:mainp @ test.kml, line: 225 >>> mainp @ vdl.k, line: 118 >>> vdl:mains @ test.kml, line: 223 >>> vdl:mains @ test.kml, line: 223 >>> rlog:restartlog @ test.kml, line: 222 >>> kernel:project @ test.kml, line: 2 >>> test-20120903-0303-pedfpqu8 >>> Caused by: The following output files were not created by the >>> application: /scratch/midway/phillicl/SwiftJobs/0.0001.output.job >>> >>> Note that >>>> ls /scratch/midway/phillicl/SwiftJobs/0.0001.output.job >>> /scratch/midway/phillicl/SwiftJobs/0.0001.output.job >>> >>> >>> On Sep 1, 2012, at 5:45 PM, Mihael Hategan >>> wrote: >>> >>>> The error comes from int checkforerror = readData(np.error); >>>> >>>> You have to use the workaround for both. >>>> >>>> On Sat, 2012-09-01 at 15:23 -0500, Carolyn Phillips wrote: >>>>> Sure >>>>> >>>>> There are a lot of extra stuff running around in the script, fyi >>>>> >>>>> # Types >>>>> type file; >>>>> type unlabeleddata; >>>>> type labeleddata; >>>>> type errorlog; >>>>> >>>>> # Structured Types >>>>> type pointfile { >>>>> unlabeleddata points; >>>>> errorlog error; >>>>> } >>>>> >>>>> type simulationfile { >>>>> file output; >>>>> } >>>>> >>>>> # Apps >>>>> app (file o) cat (file i) >>>>> { >>>>> cat @i stdout=@o; >>>>> } >>>>> >>>>> app (file o) cat2 (file i) >>>>> { >>>>> systeminfo stdout=@o; >>>>> } >>>>> >>>>> app (pointfile o) generatepoints (file c, labeleddata f, string >>>>> mode, int Npoints) >>>>> { >>>>> matlab_callgeneratepoints @c @f mode Npoints @o.points @o.error; >>>>> } >>>>> >>>>> #app (simulationfile o) runSimulation(string p) >>>>> #{ >>>>> # launchjob p @o.output; >>>>> #} >>>>> >>>>> #Files (using single file mapper) >>>>> file config <"designspace.config">; >>>>> labeleddata labeledpoints <"emptypoints.dat">; >>>>> >>>>> type pointlog; >>>>> >>>>> # Loop >>>>> iterate passes { >>>>> >>>>> # Generate Parameters >>>>> pointfile np ; >>>>> np = generatepoints(config,labeledpoints, "uniform", 50); >>>>> >>>>> int checkforerror = readData(np.error); >>>>> tracef("%s: %i\n", "Generate Parameters Error Value", >>>>> checkforerror); >>>>> >>>>> # Issue Jobs >>>>> #simulationfile simfiles[] >>>>> ; >>>>> if(checkforerror==0) { >>>>> unlabeleddata pl = np.points; >>>>> string parameters[] =readData(pl); >>>>> foreach p,pindex in parameters { >>>>> tracef("Launch Job for Parameters: %s\n", p); >>>>> #simfiles[pindex] = runSimulation(p); >>>>> } >>>>> } >>>>> >>>>> # Analyze Jobs >>>>> >>>>> # Generate Prediction >>>>> >>>>> >>>>> >>>>> # creates an array of datafiles named swifttest..out >>>>> to >>>>> write to >>>>> file out[]>>>> prefix=@strcat("swifttest.",passes,"."),suffix=".out">; >>>>> >>>>> # creates a default of 10 files >>>>> foreach j in [1:@toInt(@arg("n","10"))] { >>>>> file data<"data.txt">; >>>>> out[j] = cat2(data); >>>>> } >>>>> >>>>> # try writing the iteration to a log file >>>>> file passlog <"passes.log">; >>>>> passlog = writeData(passes); >>>>> >>>>> # try reading from another log file >>>>> int readpasses = readData(passlog); >>>>> >>>>> # Write to the Output Log >>>>> tracef("%s: %i\n", "Iteration :", passes); >>>>> tracef("%s: %i\n", "Iteration Read :", readpasses); >>>>> >>>>> #} until (readpasses == 2); # Determine if Done >>>>> } until (passes == 1); # Determine if Done >>>>> >>>>> >>>>> On Sep 1, 2012, at 1:57 PM, Mihael Hategan >>>>> wrote: >>>>> >>>>>> Can you post the entire script? >>>>>> >>>>>> On Sat, 2012-09-01 at 12:29 -0500, Carolyn Phillips wrote: >>>>>>> Yes, I tried that >>>>>>> >>>>>>> unlabeleddata pl = np.points; >>>>>>> string parameters[] =readData(pl); >>>>>>> >>>>>>> >>>>>>> and I got >>>>>>> >>>>>>> Execution failed: >>>>>>> mypoints..dat (No such file or directory) >>>>>>> >>>>>>> On Aug 31, 2012, at 8:27 PM, Mihael Hategan >>>>>>> >>>>>>> wrote: >>>>>>> >>>>>>>> On Fri, 2012-08-31 at 20:11 -0500, Carolyn Phillips wrote: >>>>>>>>> How would this line work for what I have below? >>>>>>>>> >>>>>>>>>>> string parameters[] =readData(np.points); >>>>>>>>> >>>>>>>> >>>>>>>> unlabeleddata tmp = np.points; >>>>>>>> string parameters[] = readData(tmp); >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Aug 31, 2012, at 7:49 PM, Mihael Hategan >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Another bug. >>>>>>>>>> >>>>>>>>>> I committed a fix. In the mean time, the solution is: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> errorlog fe = np.errorlog; >>>>>>>>>> >>>>>>>>>> int error = readData(fe); >>>>>>>>>> >>>>>>>>>> On Fri, 2012-08-31 at 19:29 -0500, Carolyn Phillips wrote: >>>>>>>>>>> Hi Mihael, >>>>>>>>>>> >>>>>>>>>>> the reason I added the "@" was because >>>>>>>>>>> >>>>>>>>>>> now this (similar) line >>>>>>>>>>> >>>>>>>>>>> if(checkforerror==0) { >>>>>>>>>>> string parameters[] =readData(np.points); >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> gives me this: >>>>>>>>>>> >>>>>>>>>>> Execution failed: >>>>>>>>>>> mypoints..dat (No such file or directory) >>>>>>>>>>> >>>>>>>>>>> as in now its not getting the name of the file correct >>>>>>>>>>> >>>>>>>>>>> On Aug 31, 2012, at 7:17 PM, Mihael Hategan >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> @np.error means the file name of np.error which is known >>>>>>>>>>>> statically. So >>>>>>>>>>>> readData(@np.error) can run as soon as the script starts. >>>>>>>>>>>> >>>>>>>>>>>> You probably want to say readData(np.error). >>>>>>>>>>>> >>>>>>>>>>>> Mihael >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Fri, 2012-08-31 at 18:55 -0500, Carolyn Phillips wrote: >>>>>>>>>>>>> So I execute an atomic procedure to generate a datafile, >>>>>>>>>>>>> and then next >>>>>>>>>>>>> I want to do something with that data file. However, my >>>>>>>>>>>>> program is >>>>>>>>>>>>> trying to do something with the datafile before it has >>>>>>>>>>>>> been >>>>>>>>>>>>> written >>>>>>>>>>>>> to. So something with order of execution is not working. >>>>>>>>>>>>> I >>>>>>>>>>>>> think the >>>>>>>>>>>>> problem is that the name of my file exists, but the file >>>>>>>>>>>>> itself does >>>>>>>>>>>>> not yet, but execution proceeds anyway! >>>>>>>>>>>>> >>>>>>>>>>>>> Here are my lines >>>>>>>>>>>>> >>>>>>>>>>>>> type pointfile { >>>>>>>>>>>>> unlabeleddata points; >>>>>>>>>>>>> errorlog error; >>>>>>>>>>>>> } >>>>>>>>>>>>> >>>>>>>>>>>>> # Generate Parameters >>>>>>>>>>>>> pointfile np >>>>>>>>>>>>> ; >>>>>>>>>>>>> np = generatepoints(config,labeledpoints, "uniform", 50); >>>>>>>>>>>>> >>>>>>>>>>>>> int checkforerror = readData(@np.error); >>>>>>>>>>>>> >>>>>>>>>>>>> This gives an error : >>>>>>>>>>>>> mypoints.error.dat (No such file or directory) >>>>>>>>>>>>> >>>>>>>>>>>>> If I comment out the last line.. all the files show up in >>>>>>>>>>>>> the directory. (e.g. mypoints.points.dat and >>>>>>>>>>>>> mypoints.error.dat) ) and if forget to remove the .dat >>>>>>>>>>>>> files from a prior run, it also runs fine! >>>>>>>>>>>>> >>>>>>>>>>>>> How do you fix a problem like that? >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> Swift-user mailing list >>>>>>>>>>>>> Swift-user at ci.uchicago.edu >>>>>>>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>> >>> _______________________________________________ >>> Swift-user mailing list >>> Swift-user at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >> >> -- >> Michael Wilde >> Computation Institute, University of Chicago >> Mathematics and Computer Science Division >> Argonne National Laboratory >> >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > From cphillips at mcs.anl.gov Mon Sep 10 13:46:13 2012 From: cphillips at mcs.anl.gov (Carolyn Phillips) Date: Mon, 10 Sep 2012 13:46:13 -0500 Subject: [Swift-user] Output Files, ReadData and Order of Execution In-Reply-To: <1228936297.5835.1347047563931.JavaMail.root@zimbra.anl.gov> References: <1228936297.5835.1347047563931.JavaMail.root@zimbra.anl.gov> Message-ID: <1ED59F12-EE02-478A-908D-4F8C3B2F7F51@mcs.anl.gov> Hey Mike. This worked, but I still had to add the extra "/" -Carolyn On Sep 7, 2012, at 2:52 PM, Michael Wilde wrote: > Carolyn, to follow up on your question: below is an example of using the "direct" file access mode. > > - Mike > > #----- The swift script > > $ cat catsndirect.swift > > type file; > > app (file o) cat (file i) > { > cat @i stdout=@o; > } > > file out[]; > > foreach j in [1:@toint(@arg("n","1"))] { > file data<"/tmp/wilde/indir/data.txt">; > out[j] = cat(data); > } > > #----- The "cdm" file: > > $ cat direct > rule .* DIRECT / > > #----- The command line: > > $ swift -config cf -cdm.file direct -tc.file tc -sites.file sites.xml catsndirect.swift -n=10 > > #----- The output and input dirs: > > $ ls -lr /tmp/wilde/{in,out}dir > > /tmp/wilde/outdir: > > total 40 > -rw-r--r-- 1 wilde ci-users 8 Sep 7 14:11 f.0010.out > -rw-r--r-- 1 wilde ci-users 8 Sep 7 14:11 f.0009.out > -rw-r--r-- 1 wilde ci-users 8 Sep 7 14:11 f.0008.out > -rw-r--r-- 1 wilde ci-users 8 Sep 7 14:11 f.0007.out > -rw-r--r-- 1 wilde ci-users 8 Sep 7 14:11 f.0006.out > -rw-r--r-- 1 wilde ci-users 8 Sep 7 14:11 f.0005.out > -rw-r--r-- 1 wilde ci-users 8 Sep 7 14:11 f.0004.out > -rw-r--r-- 1 wilde ci-users 8 Sep 7 14:11 f.0003.out > -rw-r--r-- 1 wilde ci-users 8 Sep 7 14:11 f.0002.out > -rw-r--r-- 1 wilde ci-users 8 Sep 7 14:11 f.0001.out > > /tmp/wilde/indir: > > total 4 > -rw-r--r-- 1 wilde ci-users 8 Sep 7 13:47 data.txt > com$ > > > ----- Original Message ----- >> From: "Michael Wilde" >> To: "Carolyn Phillips" >> Cc: swift-user at ci.uchicago.edu >> Sent: Wednesday, September 5, 2012 8:40:28 AM >> Subject: Re: [Swift-user] Output Files, ReadData and Order of Execution >> Hi Carolyn, >> >> I think this error is due to the fact that the launchjob script is not >> coded to match Swift's file management conventions. >> >> Unless you declare that you want to use "Direct" file management, >> Swift will expect output files to be created relative to the directory >> in which it runs your app() scripts. Thats why it was pulling off the >> leading "/". By putting a // at the front of the pathname, you were >> inadvertantly causing your launchjob script to place its output file >> in a different directory than where Swift was expecting it. >> >> There's a further mismatch I think between the mapped filename from >> simple_mapper (which defaults to 4-digit strings for indices) and the >> names that launchjob is trying to create. >> >> I'll need to send further clarification later, but for now, could you >> try the following: >> >> - go back to using a single leading "/" >> - comment out the mkdir and cd in launchjob, as $3 contains the >> correct pathname to write to (which will be a long relative pathname >> without the leaning "/") >> >> I think what you really want here is "DIRECT" file management mode, >> explained at: >> >> http://www.ci.uchicago.edu/swift/guides/trunk/userguide/userguide.html#_policy_descriptions >> >> We need to enhance the User Guide to explain this clearly and fully. >> >> - Mike >> >> ----- Original Message ----- >>> From: "Carolyn Phillips" >>> To: "Mihael Hategan" >>> Cc: swift-user at ci.uchicago.edu >>> Sent: Sunday, September 2, 2012 10:16:05 PM >>> Subject: Re: [Swift-user] Output Files, ReadData and Order of >>> Execution >>> You are right. That was my problem. (dumb!) >>> >>> Anyway. My next issue is that Swift is telling me it can't find a >>> file >>> that exists. Perhaps this is because it does not understand absolute >>> directory paths the way I am specifying them? >>> >>> The short version is that, using a simple mapper, I specify the >>> location as ;location="//scratch/midway/phillicl/SwiftJobs/" Note >>> that >>> I had to put two // at the beginning because the first backslash >>> gets >>> removed for some reason. Then I pass that file name to a script >>> write >>> to that file. But then Swift doesn't see the file >>> >>> Here is the more detailed version >>> >>> I have a script called launch jobs that does the following >>>> cd /scratch/midway/phillicl/SwiftJobs >>>> >>>> mkdir Job.${1}.${2} >>>> cd Job.${1}.${2} >>>> >>>> # Copy in some files and do some work >>>> >>>> pwd > ${3} >>> >>> >>> >>> Here is the swift script >>> >>>> # Types >>>> type file; >>>> type unlabeleddata; >>>> type labeleddata; >>>> type errorlog; >>>> >>>> # Structured Types >>>> type pointfile { >>>> unlabeleddata points; >>>> errorlog error; >>>> } >>>> >>>> type simulationfile { >>>> file output; >>>> } >>>> >>>> # Apps >>>> app (file o) cat (file i) >>>> { >>>> cat @i stdout=@o; >>>> } >>>> >>>> app (file o) cat2 (file i) >>>> { >>>> systeminfo stdout=@o; >>>> } >>>> >>>> app (pointfile o) generatepoints (file c, labeleddata f, string >>>> mode, int Npoints) >>>> { >>>> matlab_callgeneratepoints @c @f mode Npoints @o.points @o.error; >>>> } >>>> >>>> app (simulationfile o) runSimulation(string p,int passes, int >>>> pindex) >>>> { >>>> launchjob passes pindex @o.output; >>>> } >>>> >>>> #Files (using single file mapper) >>>> file config <"designspace.config">; >>>> labeleddata labeledpoints <"emptypoints.dat">; >>>> >>>> type pointlog; >>>> >>>> # Loop >>>> iterate passes { >>>> >>>> # Generate Parameters >>>> pointfile np ; >>>> np = generatepoints(config,labeledpoints, "uniform", 50); >>>> >>>> errorlog fe = np.error; >>>> int checkforerror = readData(fe); >>>> tracef("%s: %i\n", "Generate Parameters Error Value", >>>> checkforerror); >>>> >>>> # Issue Jobs >>>> simulationfile simfiles[] >>>> ; >>>> if(checkforerror==0) { >>>> unlabeleddata pl = np.points; >>>> string parameters[] =readData(pl); >>>> foreach p,pindex in parameters { >>>> tracef("Launch Job for Parameters: %s\n", p); >>>> simfiles[pindex] = runSimulation(p,passes,pindex); >>>> } >>>> } >>>> >>>> # Analyze Jobs >>>> >>>> # Generate Prediction >>>> >>>> >>>> >>>> # creates an array of datafiles named swifttest..out >>>> to >>>> write to >>>> file out[]>>> prefix=@strcat("swifttest.",passes,"."),suffix=".out">; >>>> >>>> # creates a default of 10 files >>>> foreach j in [1:@toInt(@arg("n","10"))] { >>>> file data<"data.txt">; >>>> out[j] = cat2(data); >>>> } >>>> >>>> # try writing the iteration to a log file >>>> file passlog <"passes.log">; >>>> passlog = writeData(passes); >>>> >>>> # try reading from another log file >>>> int readpasses = readData(passlog); >>>> >>>> # Write to the Output Log >>>> tracef("%s: %i\n", "Iteration :", passes); >>>> tracef("%s: %i\n", "Iteration Read :", readpasses); >>>> >>>> #} until (readpasses == 2); # Determine if Done >>>> } until (passes == 1); # Determine if Done >>> >>> >>> And Here is the error I get: >>> >>> EXCEPTION Exception in launchjob: >>> Arguments: [0, 1, >>> /scratch/midway/phillicl/SwiftJobs/0.0001.output.job] >>> Host: pbs >>> Directory: test-20120903-0303-pedfpqu8/jobs/f/launchjob-fff1mjxk >>> stderr.txt: >>> stdout.txt: >>> ---- >>> >>> sys:exception @ vdl-int.k, line: 601 >>> sys:throw @ vdl-int.k, line: 600 >>> sys:catch @ vdl-int.k, line: 567 >>> sys:try @ vdl-int.k, line: 469 >>> task:allocatehost @ vdl-int.k, line: 419 >>> vdl:execute2 @ execute-default.k, line: 23 >>> sys:ignoreerrors @ execute-default.k, line: 21 >>> sys:parallelfor @ execute-default.k, line: 20 >>> sys:restartonerror @ execute-default.k, line: 16 >>> sys:sequential @ execute-default.k, line: 14 >>> sys:try @ execute-default.k, line: 13 >>> sys:if @ execute-default.k, line: 12 >>> sys:then @ execute-default.k, line: 11 >>> sys:if @ execute-default.k, line: 10 >>> vdl:execute @ test.kml, line: 182 >>> run_simulation @ test.kml, line: 480 >>> sys:parallel @ test.kml, line: 465 >>> foreach @ test.kml, line: 456 >>> sys:parallel @ test.kml, line: 427 >>> sys:then @ test.kml, line: 409 >>> sys:if @ test.kml, line: 404 >>> sys:sequential @ test.kml, line: 402 >>> sys:parallel @ test.kml, line: 315 >>> iterate @ test.kml, line: 229 >>> vdl:sequentialwithid @ test.kml, line: 226 >>> vdl:mainp @ test.kml, line: 225 >>> mainp @ vdl.k, line: 118 >>> vdl:mains @ test.kml, line: 223 >>> vdl:mains @ test.kml, line: 223 >>> rlog:restartlog @ test.kml, line: 222 >>> kernel:project @ test.kml, line: 2 >>> test-20120903-0303-pedfpqu8 >>> Caused by: The following output files were not created by the >>> application: /scratch/midway/phillicl/SwiftJobs/0.0001.output.job >>> >>> Note that >>>> ls /scratch/midway/phillicl/SwiftJobs/0.0001.output.job >>> /scratch/midway/phillicl/SwiftJobs/0.0001.output.job >>> >>> >>> On Sep 1, 2012, at 5:45 PM, Mihael Hategan >>> wrote: >>> >>>> The error comes from int checkforerror = readData(np.error); >>>> >>>> You have to use the workaround for both. >>>> >>>> On Sat, 2012-09-01 at 15:23 -0500, Carolyn Phillips wrote: >>>>> Sure >>>>> >>>>> There are a lot of extra stuff running around in the script, fyi >>>>> >>>>> # Types >>>>> type file; >>>>> type unlabeleddata; >>>>> type labeleddata; >>>>> type errorlog; >>>>> >>>>> # Structured Types >>>>> type pointfile { >>>>> unlabeleddata points; >>>>> errorlog error; >>>>> } >>>>> >>>>> type simulationfile { >>>>> file output; >>>>> } >>>>> >>>>> # Apps >>>>> app (file o) cat (file i) >>>>> { >>>>> cat @i stdout=@o; >>>>> } >>>>> >>>>> app (file o) cat2 (file i) >>>>> { >>>>> systeminfo stdout=@o; >>>>> } >>>>> >>>>> app (pointfile o) generatepoints (file c, labeleddata f, string >>>>> mode, int Npoints) >>>>> { >>>>> matlab_callgeneratepoints @c @f mode Npoints @o.points @o.error; >>>>> } >>>>> >>>>> #app (simulationfile o) runSimulation(string p) >>>>> #{ >>>>> # launchjob p @o.output; >>>>> #} >>>>> >>>>> #Files (using single file mapper) >>>>> file config <"designspace.config">; >>>>> labeleddata labeledpoints <"emptypoints.dat">; >>>>> >>>>> type pointlog; >>>>> >>>>> # Loop >>>>> iterate passes { >>>>> >>>>> # Generate Parameters >>>>> pointfile np ; >>>>> np = generatepoints(config,labeledpoints, "uniform", 50); >>>>> >>>>> int checkforerror = readData(np.error); >>>>> tracef("%s: %i\n", "Generate Parameters Error Value", >>>>> checkforerror); >>>>> >>>>> # Issue Jobs >>>>> #simulationfile simfiles[] >>>>> ; >>>>> if(checkforerror==0) { >>>>> unlabeleddata pl = np.points; >>>>> string parameters[] =readData(pl); >>>>> foreach p,pindex in parameters { >>>>> tracef("Launch Job for Parameters: %s\n", p); >>>>> #simfiles[pindex] = runSimulation(p); >>>>> } >>>>> } >>>>> >>>>> # Analyze Jobs >>>>> >>>>> # Generate Prediction >>>>> >>>>> >>>>> >>>>> # creates an array of datafiles named swifttest..out >>>>> to >>>>> write to >>>>> file out[]>>>> prefix=@strcat("swifttest.",passes,"."),suffix=".out">; >>>>> >>>>> # creates a default of 10 files >>>>> foreach j in [1:@toInt(@arg("n","10"))] { >>>>> file data<"data.txt">; >>>>> out[j] = cat2(data); >>>>> } >>>>> >>>>> # try writing the iteration to a log file >>>>> file passlog <"passes.log">; >>>>> passlog = writeData(passes); >>>>> >>>>> # try reading from another log file >>>>> int readpasses = readData(passlog); >>>>> >>>>> # Write to the Output Log >>>>> tracef("%s: %i\n", "Iteration :", passes); >>>>> tracef("%s: %i\n", "Iteration Read :", readpasses); >>>>> >>>>> #} until (readpasses == 2); # Determine if Done >>>>> } until (passes == 1); # Determine if Done >>>>> >>>>> >>>>> On Sep 1, 2012, at 1:57 PM, Mihael Hategan >>>>> wrote: >>>>> >>>>>> Can you post the entire script? >>>>>> >>>>>> On Sat, 2012-09-01 at 12:29 -0500, Carolyn Phillips wrote: >>>>>>> Yes, I tried that >>>>>>> >>>>>>> unlabeleddata pl = np.points; >>>>>>> string parameters[] =readData(pl); >>>>>>> >>>>>>> >>>>>>> and I got >>>>>>> >>>>>>> Execution failed: >>>>>>> mypoints..dat (No such file or directory) >>>>>>> >>>>>>> On Aug 31, 2012, at 8:27 PM, Mihael Hategan >>>>>>> >>>>>>> wrote: >>>>>>> >>>>>>>> On Fri, 2012-08-31 at 20:11 -0500, Carolyn Phillips wrote: >>>>>>>>> How would this line work for what I have below? >>>>>>>>> >>>>>>>>>>> string parameters[] =readData(np.points); >>>>>>>>> >>>>>>>> >>>>>>>> unlabeleddata tmp = np.points; >>>>>>>> string parameters[] = readData(tmp); >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Aug 31, 2012, at 7:49 PM, Mihael Hategan >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Another bug. >>>>>>>>>> >>>>>>>>>> I committed a fix. In the mean time, the solution is: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> errorlog fe = np.errorlog; >>>>>>>>>> >>>>>>>>>> int error = readData(fe); >>>>>>>>>> >>>>>>>>>> On Fri, 2012-08-31 at 19:29 -0500, Carolyn Phillips wrote: >>>>>>>>>>> Hi Mihael, >>>>>>>>>>> >>>>>>>>>>> the reason I added the "@" was because >>>>>>>>>>> >>>>>>>>>>> now this (similar) line >>>>>>>>>>> >>>>>>>>>>> if(checkforerror==0) { >>>>>>>>>>> string parameters[] =readData(np.points); >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> gives me this: >>>>>>>>>>> >>>>>>>>>>> Execution failed: >>>>>>>>>>> mypoints..dat (No such file or directory) >>>>>>>>>>> >>>>>>>>>>> as in now its not getting the name of the file correct >>>>>>>>>>> >>>>>>>>>>> On Aug 31, 2012, at 7:17 PM, Mihael Hategan >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> @np.error means the file name of np.error which is known >>>>>>>>>>>> statically. So >>>>>>>>>>>> readData(@np.error) can run as soon as the script starts. >>>>>>>>>>>> >>>>>>>>>>>> You probably want to say readData(np.error). >>>>>>>>>>>> >>>>>>>>>>>> Mihael >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Fri, 2012-08-31 at 18:55 -0500, Carolyn Phillips wrote: >>>>>>>>>>>>> So I execute an atomic procedure to generate a datafile, >>>>>>>>>>>>> and then next >>>>>>>>>>>>> I want to do something with that data file. However, my >>>>>>>>>>>>> program is >>>>>>>>>>>>> trying to do something with the datafile before it has >>>>>>>>>>>>> been >>>>>>>>>>>>> written >>>>>>>>>>>>> to. So something with order of execution is not working. >>>>>>>>>>>>> I >>>>>>>>>>>>> think the >>>>>>>>>>>>> problem is that the name of my file exists, but the file >>>>>>>>>>>>> itself does >>>>>>>>>>>>> not yet, but execution proceeds anyway! >>>>>>>>>>>>> >>>>>>>>>>>>> Here are my lines >>>>>>>>>>>>> >>>>>>>>>>>>> type pointfile { >>>>>>>>>>>>> unlabeleddata points; >>>>>>>>>>>>> errorlog error; >>>>>>>>>>>>> } >>>>>>>>>>>>> >>>>>>>>>>>>> # Generate Parameters >>>>>>>>>>>>> pointfile np >>>>>>>>>>>>> ; >>>>>>>>>>>>> np = generatepoints(config,labeledpoints, "uniform", 50); >>>>>>>>>>>>> >>>>>>>>>>>>> int checkforerror = readData(@np.error); >>>>>>>>>>>>> >>>>>>>>>>>>> This gives an error : >>>>>>>>>>>>> mypoints.error.dat (No such file or directory) >>>>>>>>>>>>> >>>>>>>>>>>>> If I comment out the last line.. all the files show up in >>>>>>>>>>>>> the directory. (e.g. mypoints.points.dat and >>>>>>>>>>>>> mypoints.error.dat) ) and if forget to remove the .dat >>>>>>>>>>>>> files from a prior run, it also runs fine! >>>>>>>>>>>>> >>>>>>>>>>>>> How do you fix a problem like that? >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> Swift-user mailing list >>>>>>>>>>>>> Swift-user at ci.uchicago.edu >>>>>>>>>>>>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>> >>> _______________________________________________ >>> Swift-user mailing list >>> Swift-user at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >> >> -- >> Michael Wilde >> Computation Institute, University of Chicago >> Mathematics and Computer Science Division >> Argonne National Laboratory >> >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > From robinweiss at uchicago.edu Mon Sep 10 16:56:45 2012 From: robinweiss at uchicago.edu (Robin Weiss) Date: Mon, 10 Sep 2012 21:56:45 +0000 Subject: [Swift-user] Custom/External Mappers Message-ID: <97B8C98AF6CD2146ADC0FEB4066CD50602A96535@xm-mbx-04-prod.ad.uchicago.edu> Howdy swifters, I am having some issues getting external mappers to work. In particular, swift appears to hang when you use a custom mapper that takes in one or more command line arguments. Attached is a tar ball with a basic example of the issue. You can uncomment each line in the mapper.swift script to see the behavior. I have included 4 versions of my external mapper, two work and two do not (see mapper.swift). Also included are the config, sites (using localhost), and tc files I'm using. I noticed this issue after moving from version 0.93 to trunk (Swift trunk swift-r5917 cog-r3463) when some scripts I had been using started to hang. Thanks in advance, Robin -- Robin M. Weiss Research Programmer Research Computing Center The University of Chicago 6030 S. Ellis Ave., Suite 289C Chicago, IL 60637 robinweiss at uchicago.edu 773.702.9030 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: customMapper.tar Type: application/x-tar Size: 20480 bytes Desc: customMapper.tar URL: From jmargolpeople at gmail.com Wed Sep 12 10:50:35 2012 From: jmargolpeople at gmail.com (Jonathan Margoliash) Date: Wed, 12 Sep 2012 11:50:35 -0400 Subject: [Swift-user] Getting swift to run on Fusion Message-ID: Hello swift support, This is my first attempt getting swift to work on Fusion, and I'm getting the following output to the terminal: ------ Warning: Function toint is deprecated, at line 10 Swift trunk swift-r5882 cog-r3434 RunID: 20120912-1032-5y7xb1ug Progress: time: Wed, 12 Sep 2012 10:32:51 -0500 Progress: time: Wed, 12 Sep 2012 10:32:54 -0500 Selecting site:34 Submitted:8 Progress: time: Wed, 12 Sep 2012 10:32:57 -0500 Selecting site:34 Submitted:8 Progress: time: Wed, 12 Sep 2012 10:33:00 -0500 Selecting site:34 Submitted:8 ... Progress: time: Wed, 12 Sep 2012 10:40:33 -0500 Selecting site:34 Submitted:8 Failed to shut down block: Block 0912-321051-000005 (8x60.000s) org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Failed to cancel task. qdel returned with an exit code of 153 at org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.cancel(AbstractExecutor.java:205) at org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85) at org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:69) at org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:102) at org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:91) at org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.cancel(BlockTaskSubmitter.java:46) at org.globus.cog.abstraction.coaster.service.job.manager.Block.forceShutdown(Block.java:320) at org.globus.cog.abstraction.coaster.service.job.manager.Node.errorReceived(Node.java:100) at org.globus.cog.karajan.workflow.service.commands.Command.errorReceived(Command.java:203) at org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyListeners(ChannelContext.java:237) at org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyRegisteredCommandsAndHandlers(ChannelContext.java:225) at org.globus.cog.karajan.workflow.service.channels.ChannelContext.channelShutDown(ChannelContext.java:318) at org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:293) at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleChannelException(AbstractKarajanChannel.java:552) at org.globus.cog.karajan.workflow.service.channels.NIOSender.run(NIOSender.java:140) Progress: time: Wed, 12 Sep 2012 10:40:36 -0500 Selecting site:34 Submitted:8 Progress: time: Wed, 12 Sep 2012 10:40:39 -0500 Selecting site:34 Submitted:8 Progress: time: Wed, 12 Sep 2012 10:40:42 -0500 Selecting site:34 Submitted:8 ... Progress: time: Wed, 12 Sep 2012 10:41:42 -0500 Selecting site:34 Submitted:8 Progress: time: Wed, 12 Sep 2012 10:41:45 -0500 Selecting site:34 Submitted:8 Failed to shut down block: Block 0912-321051-000006 (8x60.000s) org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: Failed to cancel task. qdel returned with an exit code of 153 at org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.cancel(AbstractExecutor.java:205) at org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85) at org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:69) at org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:102) at org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:91) at org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.cancel(BlockTaskSubmitter.java:46) at org.globus.cog.abstraction.coaster.service.job.manager.Block.forceShutdown(Block.java:320) at org.globus.cog.abstraction.coaster.service.job.manager.Node.errorReceived(Node.java:100) at org.globus.cog.karajan.workflow.service.commands.Command.errorReceived(Command.java:203) at org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyListeners(ChannelContext.java:237) at org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyRegisteredCommandsAndHandlers(ChannelContext.java:225) at org.globus.cog.karajan.workflow.service.channels.ChannelContext.channelShutDown(ChannelContext.java:318) at org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:293) at org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleChannelException(AbstractKarajanChannel.java:552) at org.globus.cog.karajan.workflow.service.channels.NIOSender.run(NIOSender.java:140) Progress: time: Wed, 12 Sep 2012 10:41:48 -0500 Selecting site:34 Submitted:8 Progress: time: Wed, 12 Sep 2012 10:41:51 -0500 Selecting site:34 Submitted:8 Progress: time: Wed, 12 Sep 2012 10:41:54 -0500 Selecting site:34 Submitted:8 ... ------ I understand the long lines of unchanging "Progress: ..." reports - the shared queue is busy, and so I am not expecting my job to be executed right away. However, I don't understand why I'm getting these "failed to cancel task" errors. I gave each individual app well more than enough time for it to run to completion. And while I set the timelimit on the entire process to be much smaller than it needs (60 in sites.xml, when the process could run for days) I presumed the entire process would just get shut down after 60 seconds of runtime. Why is this cropping up? Thanks, Jonathan -------------- next part -------------- An HTML attachment was scrubbed... URL: From davidk at ci.uchicago.edu Wed Sep 12 11:20:39 2012 From: davidk at ci.uchicago.edu (David Kelly) Date: Wed, 12 Sep 2012 11:20:39 -0500 (CDT) Subject: [Swift-user] Getting swift to run on Fusion In-Reply-To: Message-ID: <1881454633.31705.1347466839487.JavaMail.root@zimbra-mb2.anl.gov> Jonathan, Could you please provide a pointer to the log file that got created from this run? Thanks, David ----- Original Message ----- > From: "Jonathan Margoliash" > To: swift-user at ci.uchicago.edu, "Swift Language" , "Professor E. Yan" > Sent: Wednesday, September 12, 2012 10:50:35 AM > Subject: Getting swift to run on Fusion > Hello swift support, > > > This is my first attempt getting swift to work on Fusion, and I'm > getting the following output to the terminal: > > > ------ > > > > Warning: Function toint is deprecated, at line 10 > Swift trunk swift-r5882 cog-r3434 > > > RunID: 20120912-1032-5y7xb1ug > Progress: time: Wed, 12 Sep 2012 10:32:51 -0500 > Progress: time: Wed, 12 Sep 2012 10:32:54 -0500 Selecting site:34 > Submitted:8 > Progress: time: Wed, 12 Sep 2012 10:32:57 -0500 Selecting site:34 > Submitted:8 > Progress: time: Wed, 12 Sep 2012 10:33:00 -0500 Selecting site:34 > Submitted:8 > ... > Progress: time: Wed, 12 Sep 2012 10:40:33 -0500 Selecting site:34 > Submitted:8 > Failed to shut down block: Block 0912-321051-000005 (8x60.000s) > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Failed to cancel task. qdel returned with an exit code of 153 > at > org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.cancel(AbstractExecutor.java:205) > at > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85) > at > org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:69) > at > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:102) > at > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:91) > at > org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.cancel(BlockTaskSubmitter.java:46) > at > org.globus.cog.abstraction.coaster.service.job.manager.Block.forceShutdown(Block.java:320) > at > org.globus.cog.abstraction.coaster.service.job.manager.Node.errorReceived(Node.java:100) > at > org.globus.cog.karajan.workflow.service.commands.Command.errorReceived(Command.java:203) > at > org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyListeners(ChannelContext.java:237) > at > org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyRegisteredCommandsAndHandlers(ChannelContext.java:225) > at > org.globus.cog.karajan.workflow.service.channels.ChannelContext.channelShutDown(ChannelContext.java:318) > at > org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:293) > at > org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleChannelException(AbstractKarajanChannel.java:552) > at > org.globus.cog.karajan.workflow.service.channels.NIOSender.run(NIOSender.java:140) > Progress: time: Wed, 12 Sep 2012 10:40:36 -0500 Selecting site:34 > Submitted:8 > Progress: time: Wed, 12 Sep 2012 10:40:39 -0500 Selecting site:34 > Submitted:8 > Progress: time: Wed, 12 Sep 2012 10:40:42 -0500 Selecting site:34 > Submitted:8 > ... > > Progress: time: Wed, 12 Sep 2012 10:41:42 -0500 Selecting site:34 > Submitted:8 > Progress: time: Wed, 12 Sep 2012 10:41:45 -0500 Selecting site:34 > Submitted:8 > Failed to shut down block: Block 0912-321051-000006 (8x60.000s) > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > Failed to cancel task. qdel returned with an exit code of 153 > at > org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.cancel(AbstractExecutor.java:205) > at > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85) > at > org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:69) > at > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:102) > at > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:91) > at > org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.cancel(BlockTaskSubmitter.java:46) > at > org.globus.cog.abstraction.coaster.service.job.manager.Block.forceShutdown(Block.java:320) > at > org.globus.cog.abstraction.coaster.service.job.manager.Node.errorReceived(Node.java:100) > at > org.globus.cog.karajan.workflow.service.commands.Command.errorReceived(Command.java:203) > at > org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyListeners(ChannelContext.java:237) > at > org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyRegisteredCommandsAndHandlers(ChannelContext.java:225) > at > org.globus.cog.karajan.workflow.service.channels.ChannelContext.channelShutDown(ChannelContext.java:318) > at > org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:293) > at > org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleChannelException(AbstractKarajanChannel.java:552) > at > org.globus.cog.karajan.workflow.service.channels.NIOSender.run(NIOSender.java:140) > Progress: time: Wed, 12 Sep 2012 10:41:48 -0500 Selecting site:34 > Submitted:8 > Progress: time: Wed, 12 Sep 2012 10:41:51 -0500 Selecting site:34 > Submitted:8 > Progress: time: Wed, 12 Sep 2012 10:41:54 -0500 Selecting site:34 > Submitted:8 > ... > > > ------ > > > I understand the long lines of unchanging "Progress: ..." reports - > the shared queue is busy, and so I am not expecting my job to be > executed right away. However, I don't understand why I'm getting these > "failed to cancel task" errors. I gave each individual app well more > than enough time for it to run to completion. And while I set the > timelimit on the entire process to be much smaller than it needs > (60 in sites.xml, > when the process could run for days) > I presumed the entire process would just get shut down after 60 > seconds of runtime. Why is this cropping up? Thanks, > > > Jonathan From jmargolpeople at gmail.com Wed Sep 12 11:32:56 2012 From: jmargolpeople at gmail.com (Jonathan Margoliash) Date: Wed, 12 Sep 2012 12:32:56 -0400 Subject: [Swift-user] Getting swift to run on Fusion In-Reply-To: <1881454633.31705.1347466839487.JavaMail.root@zimbra-mb2.anl.gov> References: <1881454633.31705.1347466839487.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: I attached the .0.rlog, .log, .d and the swift.log files. Which of those files do you use for debugging? And these files are all located in the directory /home/jmargoliash/my_SwiftSCE2_branch_matlab/runs/run-20120912-103235 on Fusion, if that's what you were asking for. Thanks! Jonathan On Wed, Sep 12, 2012 at 12:20 PM, David Kelly wrote: > Jonathan, > > Could you please provide a pointer to the log file that got created from > this run? > > Thanks, > David > > ----- Original Message ----- > > From: "Jonathan Margoliash" > > To: swift-user at ci.uchicago.edu, "Swift Language" , > "Professor E. Yan" > > Sent: Wednesday, September 12, 2012 10:50:35 AM > > Subject: Getting swift to run on Fusion > > Hello swift support, > > > > > > This is my first attempt getting swift to work on Fusion, and I'm > > getting the following output to the terminal: > > > > > > ------ > > > > > > > > Warning: Function toint is deprecated, at line 10 > > Swift trunk swift-r5882 cog-r3434 > > > > > > RunID: 20120912-1032-5y7xb1ug > > Progress: time: Wed, 12 Sep 2012 10:32:51 -0500 > > Progress: time: Wed, 12 Sep 2012 10:32:54 -0500 Selecting site:34 > > Submitted:8 > > Progress: time: Wed, 12 Sep 2012 10:32:57 -0500 Selecting site:34 > > Submitted:8 > > Progress: time: Wed, 12 Sep 2012 10:33:00 -0500 Selecting site:34 > > Submitted:8 > > ... > > Progress: time: Wed, 12 Sep 2012 10:40:33 -0500 Selecting site:34 > > Submitted:8 > > Failed to shut down block: Block 0912-321051-000005 (8x60.000s) > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > Failed to cancel task. qdel returned with an exit code of 153 > > at > > > org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.cancel(AbstractExecutor.java:205) > > at > > > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85) > > at > > > org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:69) > > at > > > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:102) > > at > > > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:91) > > at > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.cancel(BlockTaskSubmitter.java:46) > > at > > > org.globus.cog.abstraction.coaster.service.job.manager.Block.forceShutdown(Block.java:320) > > at > > > org.globus.cog.abstraction.coaster.service.job.manager.Node.errorReceived(Node.java:100) > > at > > > org.globus.cog.karajan.workflow.service.commands.Command.errorReceived(Command.java:203) > > at > > > org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyListeners(ChannelContext.java:237) > > at > > > org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyRegisteredCommandsAndHandlers(ChannelContext.java:225) > > at > > > org.globus.cog.karajan.workflow.service.channels.ChannelContext.channelShutDown(ChannelContext.java:318) > > at > > > org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:293) > > at > > > org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleChannelException(AbstractKarajanChannel.java:552) > > at > > > org.globus.cog.karajan.workflow.service.channels.NIOSender.run(NIOSender.java:140) > > Progress: time: Wed, 12 Sep 2012 10:40:36 -0500 Selecting site:34 > > Submitted:8 > > Progress: time: Wed, 12 Sep 2012 10:40:39 -0500 Selecting site:34 > > Submitted:8 > > Progress: time: Wed, 12 Sep 2012 10:40:42 -0500 Selecting site:34 > > Submitted:8 > > ... > > > > Progress: time: Wed, 12 Sep 2012 10:41:42 -0500 Selecting site:34 > > Submitted:8 > > Progress: time: Wed, 12 Sep 2012 10:41:45 -0500 Selecting site:34 > > Submitted:8 > > Failed to shut down block: Block 0912-321051-000006 (8x60.000s) > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > Failed to cancel task. qdel returned with an exit code of 153 > > at > > > org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.cancel(AbstractExecutor.java:205) > > at > > > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85) > > at > > > org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:69) > > at > > > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:102) > > at > > > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:91) > > at > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.cancel(BlockTaskSubmitter.java:46) > > at > > > org.globus.cog.abstraction.coaster.service.job.manager.Block.forceShutdown(Block.java:320) > > at > > > org.globus.cog.abstraction.coaster.service.job.manager.Node.errorReceived(Node.java:100) > > at > > > org.globus.cog.karajan.workflow.service.commands.Command.errorReceived(Command.java:203) > > at > > > org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyListeners(ChannelContext.java:237) > > at > > > org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyRegisteredCommandsAndHandlers(ChannelContext.java:225) > > at > > > org.globus.cog.karajan.workflow.service.channels.ChannelContext.channelShutDown(ChannelContext.java:318) > > at > > > org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:293) > > at > > > org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleChannelException(AbstractKarajanChannel.java:552) > > at > > > org.globus.cog.karajan.workflow.service.channels.NIOSender.run(NIOSender.java:140) > > Progress: time: Wed, 12 Sep 2012 10:41:48 -0500 Selecting site:34 > > Submitted:8 > > Progress: time: Wed, 12 Sep 2012 10:41:51 -0500 Selecting site:34 > > Submitted:8 > > Progress: time: Wed, 12 Sep 2012 10:41:54 -0500 Selecting site:34 > > Submitted:8 > > ... > > > > > > ------ > > > > > > I understand the long lines of unchanging "Progress: ..." reports - > > the shared queue is busy, and so I am not expecting my job to be > > executed right away. However, I don't understand why I'm getting these > > "failed to cancel task" errors. I gave each individual app well more > > than enough time for it to run to completion. And while I set the > > timelimit on the entire process to be much smaller than it needs > > (60 in sites.xml, > > when the process could run for days) > > I presumed the entire process would just get shut down after 60 > > seconds of runtime. Why is this cropping up? Thanks, > > > > > > Jonathan > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: logfiles.tar Type: application/x-tar Size: 1361920 bytes Desc: not available URL: From davidk at ci.uchicago.edu Wed Sep 12 15:24:17 2012 From: davidk at ci.uchicago.edu (David Kelly) Date: Wed, 12 Sep 2012 15:24:17 -0500 (CDT) Subject: [Swift-user] Getting swift to run on Fusion In-Reply-To: Message-ID: <872332234.33473.1347481457916.JavaMail.root@zimbra-mb2.anl.gov> Jonathan, I think the error is related to something being misconfigured in sites.xml. When I tried the same version of sites.xml, I saw the same qdel error. I will look to see why that is, but in the meantime, can you please try using this sites.xml: 3600 8 shared 100 1 2 5.99 10000 100 100 /homes/davidk/my_SwiftSCE2_branch_matlab/runs/run-20120912-150613/swiftwork (Modify your workdirectory as needed). I added this to my copy of fusion_start_sce.sh, in /homes/davidk/my_SwiftSCE2_branch_matlab/fusion_start_sce.sh. It seems to work for me, at least in terms of submitting and reporting on the status of jobs. The jobs themselves fail because there are references to /usr/bin/octave in some scripts which doesn't exist on Fusion. Hopefully this should help you get a little further. Thanks, David ----- Original Message ----- > From: "Jonathan Margoliash" > To: "David Kelly" > Cc: swift-user at ci.uchicago.edu, "Professor E. Yan" > Sent: Wednesday, September 12, 2012 11:32:56 AM > Subject: Re: Getting swift to run on Fusion > I attached the .0.rlog, .log, .d and the swift.log files. Which of > those files do you use for debugging? And these files are all located > in the directory > > > /home/jmargoliash/my_SwiftSCE2_branch_matlab/runs/run-20120912-103235 > > > on Fusion, if that's what you were asking for. Thanks! > > > Jonathan > > > On Wed, Sep 12, 2012 at 12:20 PM, David Kelly < davidk at ci.uchicago.edu > > wrote: > > > Jonathan, > > Could you please provide a pointer to the log file that got created > from this run? > > Thanks, > David > > > > ----- Original Message ----- > > From: "Jonathan Margoliash" < jmargolpeople at gmail.com > > > To: swift-user at ci.uchicago.edu , "Swift Language" < > > davidk at ci.uchicago.edu >, "Professor E. Yan" < eyan at anl.gov > > > Sent: Wednesday, September 12, 2012 10:50:35 AM > > Subject: Getting swift to run on Fusion > > Hello swift support, > > > > > > This is my first attempt getting swift to work on Fusion, and I'm > > getting the following output to the terminal: > > > > > > ------ > > > > > > > > Warning: Function toint is deprecated, at line 10 > > Swift trunk swift-r5882 cog-r3434 > > > > > > RunID: 20120912-1032-5y7xb1ug > > Progress: time: Wed, 12 Sep 2012 10:32:51 -0500 > > Progress: time: Wed, 12 Sep 2012 10:32:54 -0500 Selecting site:34 > > Submitted:8 > > Progress: time: Wed, 12 Sep 2012 10:32:57 -0500 Selecting site:34 > > Submitted:8 > > Progress: time: Wed, 12 Sep 2012 10:33:00 -0500 Selecting site:34 > > Submitted:8 > > ... > > Progress: time: Wed, 12 Sep 2012 10:40:33 -0500 Selecting site:34 > > Submitted:8 > > Failed to shut down block: Block 0912-321051-000005 (8x60.000s) > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > Failed to cancel task. qdel returned with an exit code of 153 > > at > > org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.cancel(AbstractExecutor.java:205) > > at > > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85) > > at > > org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:69) > > at > > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:102) > > at > > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:91) > > at > > org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.cancel(BlockTaskSubmitter.java:46) > > at > > org.globus.cog.abstraction.coaster.service.job.manager.Block.forceShutdown(Block.java:320) > > at > > org.globus.cog.abstraction.coaster.service.job.manager.Node.errorReceived(Node.java:100) > > at > > org.globus.cog.karajan.workflow.service.commands.Command.errorReceived(Command.java:203) > > at > > org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyListeners(ChannelContext.java:237) > > at > > org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyRegisteredCommandsAndHandlers(ChannelContext.java:225) > > at > > org.globus.cog.karajan.workflow.service.channels.ChannelContext.channelShutDown(ChannelContext.java:318) > > at > > org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:293) > > at > > org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleChannelException(AbstractKarajanChannel.java:552) > > at > > org.globus.cog.karajan.workflow.service.channels.NIOSender.run(NIOSender.java:140) > > Progress: time: Wed, 12 Sep 2012 10:40:36 -0500 Selecting site:34 > > Submitted:8 > > Progress: time: Wed, 12 Sep 2012 10:40:39 -0500 Selecting site:34 > > Submitted:8 > > Progress: time: Wed, 12 Sep 2012 10:40:42 -0500 Selecting site:34 > > Submitted:8 > > ... > > > > Progress: time: Wed, 12 Sep 2012 10:41:42 -0500 Selecting site:34 > > Submitted:8 > > Progress: time: Wed, 12 Sep 2012 10:41:45 -0500 Selecting site:34 > > Submitted:8 > > Failed to shut down block: Block 0912-321051-000006 (8x60.000s) > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > Failed to cancel task. qdel returned with an exit code of 153 > > at > > org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.cancel(AbstractExecutor.java:205) > > at > > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85) > > at > > org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:69) > > at > > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:102) > > at > > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:91) > > at > > org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.cancel(BlockTaskSubmitter.java:46) > > at > > org.globus.cog.abstraction.coaster.service.job.manager.Block.forceShutdown(Block.java:320) > > at > > org.globus.cog.abstraction.coaster.service.job.manager.Node.errorReceived(Node.java:100) > > at > > org.globus.cog.karajan.workflow.service.commands.Command.errorReceived(Command.java:203) > > at > > org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyListeners(ChannelContext.java:237) > > at > > org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyRegisteredCommandsAndHandlers(ChannelContext.java:225) > > at > > org.globus.cog.karajan.workflow.service.channels.ChannelContext.channelShutDown(ChannelContext.java:318) > > at > > org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:293) > > at > > org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleChannelException(AbstractKarajanChannel.java:552) > > at > > org.globus.cog.karajan.workflow.service.channels.NIOSender.run(NIOSender.java:140) > > Progress: time: Wed, 12 Sep 2012 10:41:48 -0500 Selecting site:34 > > Submitted:8 > > Progress: time: Wed, 12 Sep 2012 10:41:51 -0500 Selecting site:34 > > Submitted:8 > > Progress: time: Wed, 12 Sep 2012 10:41:54 -0500 Selecting site:34 > > Submitted:8 > > ... > > > > > > ------ > > > > > > I understand the long lines of unchanging "Progress: ..." reports - > > the shared queue is busy, and so I am not expecting my job to be > > executed right away. However, I don't understand why I'm getting > > these > > "failed to cancel task" errors. I gave each individual app well more > > than enough time for it to run to completion. And while I set the > > timelimit on the entire process to be much smaller than it needs > > (60 in > > sites.xml, > > when the process could run for days) > > I presumed the entire process would just get shut down after 60 > > seconds of runtime. Why is this cropping up? Thanks, > > > > > > Jonathan From jmargolpeople at gmail.com Wed Sep 12 15:44:10 2012 From: jmargolpeople at gmail.com (Jonathan Margoliash) Date: Wed, 12 Sep 2012 15:44:10 -0500 Subject: [Swift-user] Getting swift to run on Fusion In-Reply-To: <872332234.33473.1347481457916.JavaMail.root@zimbra-mb2.anl.gov> References: <872332234.33473.1347481457916.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: Thanks David! I was expecting the jobs themselves to crash, but I wanted to get to that point in the debugging process. I'll try this script for now, and we'll figure out why the other one wasn't working. Thanks again, Jonathan On Wed, Sep 12, 2012 at 3:24 PM, David Kelly wrote: > Jonathan, > > I think the error is related to something being misconfigured in > sites.xml. When I tried the same version of sites.xml, I saw the same qdel > error. I will look to see why that is, but in the meantime, can you please > try using this sites.xml: > > > > > > 3600 > 8 > shared > 100 > 1 > 2 > 5.99 > 10000 > 100 > 100 > > /homes/davidk/my_SwiftSCE2_branch_matlab/runs/run-20120912-150613/swiftwork > > > > (Modify your workdirectory as needed). I added this to my copy of > fusion_start_sce.sh, > in /homes/davidk/my_SwiftSCE2_branch_matlab/fusion_start_sce.sh. It seems > to work for me, at least in terms of submitting and reporting on the status > of jobs. The jobs themselves fail because there are references to > /usr/bin/octave in some scripts which doesn't exist on Fusion. Hopefully > this should help you get a little further. > > Thanks, > David > > ----- Original Message ----- > > From: "Jonathan Margoliash" > > To: "David Kelly" > > Cc: swift-user at ci.uchicago.edu, "Professor E. Yan" > > Sent: Wednesday, September 12, 2012 11:32:56 AM > > Subject: Re: Getting swift to run on Fusion > > I attached the .0.rlog, .log, .d and the swift.log files. Which of > > those files do you use for debugging? And these files are all located > > in the directory > > > > > > /home/jmargoliash/my_SwiftSCE2_branch_matlab/runs/run-20120912-103235 > > > > > > on Fusion, if that's what you were asking for. Thanks! > > > > > > Jonathan > > > > > > On Wed, Sep 12, 2012 at 12:20 PM, David Kelly < davidk at ci.uchicago.edu > > > wrote: > > > > > > Jonathan, > > > > Could you please provide a pointer to the log file that got created > > from this run? > > > > Thanks, > > David > > > > > > > > ----- Original Message ----- > > > From: "Jonathan Margoliash" < jmargolpeople at gmail.com > > > > To: swift-user at ci.uchicago.edu , "Swift Language" < > > > davidk at ci.uchicago.edu >, "Professor E. Yan" < eyan at anl.gov > > > > Sent: Wednesday, September 12, 2012 10:50:35 AM > > > Subject: Getting swift to run on Fusion > > > Hello swift support, > > > > > > > > > This is my first attempt getting swift to work on Fusion, and I'm > > > getting the following output to the terminal: > > > > > > > > > ------ > > > > > > > > > > > > Warning: Function toint is deprecated, at line 10 > > > Swift trunk swift-r5882 cog-r3434 > > > > > > > > > RunID: 20120912-1032-5y7xb1ug > > > Progress: time: Wed, 12 Sep 2012 10:32:51 -0500 > > > Progress: time: Wed, 12 Sep 2012 10:32:54 -0500 Selecting site:34 > > > Submitted:8 > > > Progress: time: Wed, 12 Sep 2012 10:32:57 -0500 Selecting site:34 > > > Submitted:8 > > > Progress: time: Wed, 12 Sep 2012 10:33:00 -0500 Selecting site:34 > > > Submitted:8 > > > ... > > > Progress: time: Wed, 12 Sep 2012 10:40:33 -0500 Selecting site:34 > > > Submitted:8 > > > Failed to shut down block: Block 0912-321051-000005 (8x60.000s) > > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > > Failed to cancel task. qdel returned with an exit code of 153 > > > at > > > > org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.cancel(AbstractExecutor.java:205) > > > at > > > > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85) > > > at > > > > org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:69) > > > at > > > > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:102) > > > at > > > > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:91) > > > at > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.cancel(BlockTaskSubmitter.java:46) > > > at > > > > org.globus.cog.abstraction.coaster.service.job.manager.Block.forceShutdown(Block.java:320) > > > at > > > > org.globus.cog.abstraction.coaster.service.job.manager.Node.errorReceived(Node.java:100) > > > at > > > > org.globus.cog.karajan.workflow.service.commands.Command.errorReceived(Command.java:203) > > > at > > > > org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyListeners(ChannelContext.java:237) > > > at > > > > org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyRegisteredCommandsAndHandlers(ChannelContext.java:225) > > > at > > > > org.globus.cog.karajan.workflow.service.channels.ChannelContext.channelShutDown(ChannelContext.java:318) > > > at > > > > org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:293) > > > at > > > > org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleChannelException(AbstractKarajanChannel.java:552) > > > at > > > > org.globus.cog.karajan.workflow.service.channels.NIOSender.run(NIOSender.java:140) > > > Progress: time: Wed, 12 Sep 2012 10:40:36 -0500 Selecting site:34 > > > Submitted:8 > > > Progress: time: Wed, 12 Sep 2012 10:40:39 -0500 Selecting site:34 > > > Submitted:8 > > > Progress: time: Wed, 12 Sep 2012 10:40:42 -0500 Selecting site:34 > > > Submitted:8 > > > ... > > > > > > Progress: time: Wed, 12 Sep 2012 10:41:42 -0500 Selecting site:34 > > > Submitted:8 > > > Progress: time: Wed, 12 Sep 2012 10:41:45 -0500 Selecting site:34 > > > Submitted:8 > > > Failed to shut down block: Block 0912-321051-000006 (8x60.000s) > > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > > Failed to cancel task. qdel returned with an exit code of 153 > > > at > > > > org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.cancel(AbstractExecutor.java:205) > > > at > > > > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85) > > > at > > > > org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:69) > > > at > > > > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:102) > > > at > > > > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:91) > > > at > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.cancel(BlockTaskSubmitter.java:46) > > > at > > > > org.globus.cog.abstraction.coaster.service.job.manager.Block.forceShutdown(Block.java:320) > > > at > > > > org.globus.cog.abstraction.coaster.service.job.manager.Node.errorReceived(Node.java:100) > > > at > > > > org.globus.cog.karajan.workflow.service.commands.Command.errorReceived(Command.java:203) > > > at > > > > org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyListeners(ChannelContext.java:237) > > > at > > > > org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyRegisteredCommandsAndHandlers(ChannelContext.java:225) > > > at > > > > org.globus.cog.karajan.workflow.service.channels.ChannelContext.channelShutDown(ChannelContext.java:318) > > > at > > > > org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:293) > > > at > > > > org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleChannelException(AbstractKarajanChannel.java:552) > > > at > > > > org.globus.cog.karajan.workflow.service.channels.NIOSender.run(NIOSender.java:140) > > > Progress: time: Wed, 12 Sep 2012 10:41:48 -0500 Selecting site:34 > > > Submitted:8 > > > Progress: time: Wed, 12 Sep 2012 10:41:51 -0500 Selecting site:34 > > > Submitted:8 > > > Progress: time: Wed, 12 Sep 2012 10:41:54 -0500 Selecting site:34 > > > Submitted:8 > > > ... > > > > > > > > > ------ > > > > > > > > > I understand the long lines of unchanging "Progress: ..." reports - > > > the shared queue is busy, and so I am not expecting my job to be > > > executed right away. However, I don't understand why I'm getting > > > these > > > "failed to cancel task" errors. I gave each individual app well more > > > than enough time for it to run to completion. And while I set the > > > timelimit on the entire process to be much smaller than it needs > > > (60 in > > > sites.xml, > > > when the process could run for days) > > > I presumed the entire process would just get shut down after 60 > > > seconds of runtime. Why is this cropping up? Thanks, > > > > > > > > > Jonathan > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmargolpeople at gmail.com Wed Sep 12 16:46:20 2012 From: jmargolpeople at gmail.com (Jonathan Margoliash) Date: Wed, 12 Sep 2012 16:46:20 -0500 Subject: [Swift-user] Getting swift to run on Fusion In-Reply-To: References: <872332234.33473.1347481457916.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: So, I am using your sites.xml file for a slightly newer version of the code (located at home/jmargoliash/my_SwiftSCE2_branch as opposed to home/jmargoliash/my_SwiftSCE2_branch_matlab ). When I run this version, I'm getting the following error out of matlab: ---- stderr.txt: Fatal Error on startup: Unable to start the JVM. Error occurred during initialization of VM Could not reserve enough space for object heap There is not enough memory to start up the Java virtual machine. Try quitting other applications or increasing your virtual memory. ----- My question is: is this error matlab's fault, or is this sites.xml file trying to run too many apps on each node at once? Also, a tangentially related questions about the coaster service: Why would you want to have more than one coaster worker running on each node? Or rather, does each coaster worker correspond to a single app invocation, or can one coaster worker manage many simultaneous app invocations on a single node? On Wed, Sep 12, 2012 at 3:44 PM, Jonathan Margoliash < jmargolpeople at gmail.com> wrote: > Thanks David! > > I was expecting the jobs themselves to crash, but I wanted to get to that > point in the debugging process. I'll try this script for now, and we'll > figure out why the other one wasn't working. Thanks again, > > Jonathan > > > On Wed, Sep 12, 2012 at 3:24 PM, David Kelly wrote: > >> Jonathan, >> >> I think the error is related to something being misconfigured in >> sites.xml. When I tried the same version of sites.xml, I saw the same qdel >> error. I will look to see why that is, but in the meantime, can you please >> try using this sites.xml: >> >> >> >> >> >> 3600 >> 8 >> shared >> 100 >> 1 >> 2 >> 5.99 >> 10000 >> 100 >> 100 >> >> /homes/davidk/my_SwiftSCE2_branch_matlab/runs/run-20120912-150613/swiftwork >> >> >> >> (Modify your workdirectory as needed). I added this to my copy of >> fusion_start_sce.sh, >> in /homes/davidk/my_SwiftSCE2_branch_matlab/fusion_start_sce.sh. It seems >> to work for me, at least in terms of submitting and reporting on the status >> of jobs. The jobs themselves fail because there are references to >> /usr/bin/octave in some scripts which doesn't exist on Fusion. Hopefully >> this should help you get a little further. >> >> Thanks, >> David >> >> ----- Original Message ----- >> > From: "Jonathan Margoliash" >> > To: "David Kelly" >> > Cc: swift-user at ci.uchicago.edu, "Professor E. Yan" >> > Sent: Wednesday, September 12, 2012 11:32:56 AM >> > Subject: Re: Getting swift to run on Fusion >> > I attached the .0.rlog, .log, .d and the swift.log files. Which of >> > those files do you use for debugging? And these files are all located >> > in the directory >> > >> > >> > /home/jmargoliash/my_SwiftSCE2_branch_matlab/runs/run-20120912-103235 >> > >> > >> > on Fusion, if that's what you were asking for. Thanks! >> > >> > >> > Jonathan >> > >> > >> > On Wed, Sep 12, 2012 at 12:20 PM, David Kelly < davidk at ci.uchicago.edu >> > > wrote: >> > >> > >> > Jonathan, >> > >> > Could you please provide a pointer to the log file that got created >> > from this run? >> > >> > Thanks, >> > David >> > >> > >> > >> > ----- Original Message ----- >> > > From: "Jonathan Margoliash" < jmargolpeople at gmail.com > >> > > To: swift-user at ci.uchicago.edu , "Swift Language" < >> > > davidk at ci.uchicago.edu >, "Professor E. Yan" < eyan at anl.gov > >> > > Sent: Wednesday, September 12, 2012 10:50:35 AM >> > > Subject: Getting swift to run on Fusion >> > > Hello swift support, >> > > >> > > >> > > This is my first attempt getting swift to work on Fusion, and I'm >> > > getting the following output to the terminal: >> > > >> > > >> > > ------ >> > > >> > > >> > > >> > > Warning: Function toint is deprecated, at line 10 >> > > Swift trunk swift-r5882 cog-r3434 >> > > >> > > >> > > RunID: 20120912-1032-5y7xb1ug >> > > Progress: time: Wed, 12 Sep 2012 10:32:51 -0500 >> > > Progress: time: Wed, 12 Sep 2012 10:32:54 -0500 Selecting site:34 >> > > Submitted:8 >> > > Progress: time: Wed, 12 Sep 2012 10:32:57 -0500 Selecting site:34 >> > > Submitted:8 >> > > Progress: time: Wed, 12 Sep 2012 10:33:00 -0500 Selecting site:34 >> > > Submitted:8 >> > > ... >> > > Progress: time: Wed, 12 Sep 2012 10:40:33 -0500 Selecting site:34 >> > > Submitted:8 >> > > Failed to shut down block: Block 0912-321051-000005 (8x60.000s) >> > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: >> > > Failed to cancel task. qdel returned with an exit code of 153 >> > > at >> > > >> org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.cancel(AbstractExecutor.java:205) >> > > at >> > > >> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85) >> > > at >> > > >> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:69) >> > > at >> > > >> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:102) >> > > at >> > > >> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:91) >> > > at >> > > >> org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.cancel(BlockTaskSubmitter.java:46) >> > > at >> > > >> org.globus.cog.abstraction.coaster.service.job.manager.Block.forceShutdown(Block.java:320) >> > > at >> > > >> org.globus.cog.abstraction.coaster.service.job.manager.Node.errorReceived(Node.java:100) >> > > at >> > > >> org.globus.cog.karajan.workflow.service.commands.Command.errorReceived(Command.java:203) >> > > at >> > > >> org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyListeners(ChannelContext.java:237) >> > > at >> > > >> org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyRegisteredCommandsAndHandlers(ChannelContext.java:225) >> > > at >> > > >> org.globus.cog.karajan.workflow.service.channels.ChannelContext.channelShutDown(ChannelContext.java:318) >> > > at >> > > >> org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:293) >> > > at >> > > >> org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleChannelException(AbstractKarajanChannel.java:552) >> > > at >> > > >> org.globus.cog.karajan.workflow.service.channels.NIOSender.run(NIOSender.java:140) >> > > Progress: time: Wed, 12 Sep 2012 10:40:36 -0500 Selecting site:34 >> > > Submitted:8 >> > > Progress: time: Wed, 12 Sep 2012 10:40:39 -0500 Selecting site:34 >> > > Submitted:8 >> > > Progress: time: Wed, 12 Sep 2012 10:40:42 -0500 Selecting site:34 >> > > Submitted:8 >> > > ... >> > > >> > > Progress: time: Wed, 12 Sep 2012 10:41:42 -0500 Selecting site:34 >> > > Submitted:8 >> > > Progress: time: Wed, 12 Sep 2012 10:41:45 -0500 Selecting site:34 >> > > Submitted:8 >> > > Failed to shut down block: Block 0912-321051-000006 (8x60.000s) >> > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: >> > > Failed to cancel task. qdel returned with an exit code of 153 >> > > at >> > > >> org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.cancel(AbstractExecutor.java:205) >> > > at >> > > >> org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85) >> > > at >> > > >> org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:69) >> > > at >> > > >> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:102) >> > > at >> > > >> org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:91) >> > > at >> > > >> org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.cancel(BlockTaskSubmitter.java:46) >> > > at >> > > >> org.globus.cog.abstraction.coaster.service.job.manager.Block.forceShutdown(Block.java:320) >> > > at >> > > >> org.globus.cog.abstraction.coaster.service.job.manager.Node.errorReceived(Node.java:100) >> > > at >> > > >> org.globus.cog.karajan.workflow.service.commands.Command.errorReceived(Command.java:203) >> > > at >> > > >> org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyListeners(ChannelContext.java:237) >> > > at >> > > >> org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyRegisteredCommandsAndHandlers(ChannelContext.java:225) >> > > at >> > > >> org.globus.cog.karajan.workflow.service.channels.ChannelContext.channelShutDown(ChannelContext.java:318) >> > > at >> > > >> org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:293) >> > > at >> > > >> org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleChannelException(AbstractKarajanChannel.java:552) >> > > at >> > > >> org.globus.cog.karajan.workflow.service.channels.NIOSender.run(NIOSender.java:140) >> > > Progress: time: Wed, 12 Sep 2012 10:41:48 -0500 Selecting site:34 >> > > Submitted:8 >> > > Progress: time: Wed, 12 Sep 2012 10:41:51 -0500 Selecting site:34 >> > > Submitted:8 >> > > Progress: time: Wed, 12 Sep 2012 10:41:54 -0500 Selecting site:34 >> > > Submitted:8 >> > > ... >> > > >> > > >> > > ------ >> > > >> > > >> > > I understand the long lines of unchanging "Progress: ..." reports - >> > > the shared queue is busy, and so I am not expecting my job to be >> > > executed right away. However, I don't understand why I'm getting >> > > these >> > > "failed to cancel task" errors. I gave each individual app well more >> > > than enough time for it to run to completion. And while I set the >> > > timelimit on the entire process to be much smaller than it needs >> > > (60 in >> > > sites.xml, >> > > when the process could run for days) >> > > I presumed the entire process would just get shut down after 60 >> > > seconds of runtime. Why is this cropping up? Thanks, >> > > >> > > >> > > Jonathan >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From davidk at ci.uchicago.edu Thu Sep 13 10:01:00 2012 From: davidk at ci.uchicago.edu (David Kelly) Date: Thu, 13 Sep 2012 10:01:00 -0500 (CDT) Subject: [Swift-user] Getting swift to run on Fusion In-Reply-To: Message-ID: <227424889.36340.1347548460809.JavaMail.root@zimbra-mb2.anl.gov> Jonathan, Just trying to understand the workflow better.. from what I can tell it is currently something like this: You run the swift script called calculate_point_values.swift calculate_point_values calls run_swat_wrapper, which runs matlab matlab loads a file called run_swat_wrapper.m Is one of the matlab files launching instances of swift on the worker nodes? David ----- Original Message ----- > From: "Jonathan Margoliash" > To: "David Kelly" > Cc: swift-user at ci.uchicago.edu, "Professor E. Yan" > Sent: Wednesday, September 12, 2012 4:46:20 PM > Subject: Re: Getting swift to run on Fusion > So, I am using your sites.xml file for a slightly newer version of the > code (located at home/jmargoliash/my_SwiftSCE2_branch as opposed to > home/jmargoliash/my_SwiftSCE2_branch_matlab ). When I run this > version, I'm getting the following error out of matlab: > > > > ---- > > stderr.txt: Fatal Error on startup: Unable to start the JVM. > Error occurred during initialization of VM > Could not reserve enough space for object heap > > > There is not enough memory to start up the Java virtual machine. > Try quitting other applications or increasing your virtual memory. > ----- > > > My question is: is this error matlab's fault, or is this sites.xml > file trying to run too many apps on each node at once? > > > Also, a tangentially related questions about the coaster service: > Why would you want to have more than one coaster worker running on > each node? Or rather, does each coaster worker correspond to a single > app invocation, or can one coaster worker manage many simultaneous app > invocations on a single node? > > > > > On Wed, Sep 12, 2012 at 3:44 PM, Jonathan Margoliash < > jmargolpeople at gmail.com > wrote: > > > Thanks David! > > I was expecting the jobs themselves to crash, but I wanted to get to > that point in the debugging process. I'll try this script for now, and > we'll figure out why the other one wasn't working. Thanks again, > > > Jonathan > > > > > On Wed, Sep 12, 2012 at 3:24 PM, David Kelly < davidk at ci.uchicago.edu > > wrote: > > > Jonathan, > > I think the error is related to something being misconfigured in > sites.xml. When I tried the same version of sites.xml, I saw the same > qdel error. I will look to see why that is, but in the meantime, can > you please try using this sites.xml: > > > > > > 3600 > 8 > shared > 100 > 1 > 2 > 5.99 > 10000 > 100 > 100 > /homes/davidk/my_SwiftSCE2_branch_matlab/runs/run-20120912-150613/swiftwork > > > > (Modify your workdirectory as needed). I added this to my copy of > fusion_start_sce.sh, in > /homes/davidk/my_SwiftSCE2_branch_matlab/fusion_start_sce.sh. It seems > to work for me, at least in terms of submitting and reporting on the > status of jobs. The jobs themselves fail because there are references > to /usr/bin/octave in some scripts which doesn't exist on Fusion. > Hopefully this should help you get a little further. > > > Thanks, > David > > ----- Original Message ----- > > From: "Jonathan Margoliash" < jmargolpeople at gmail.com > > > > To: "David Kelly" < davidk at ci.uchicago.edu > > > Cc: swift-user at ci.uchicago.edu , "Professor E. Yan" < eyan at anl.gov > > > Sent: Wednesday, September 12, 2012 11:32:56 AM > > > Subject: Re: Getting swift to run on Fusion > > I attached the .0.rlog, .log, .d and the swift.log files. Which of > > those files do you use for debugging? And these files are all > > located > > in the directory > > > > > > /home/jmargoliash/my_SwiftSCE2_branch_matlab/runs/run-20120912-103235 > > > > > > on Fusion, if that's what you were asking for. Thanks! > > > > > > Jonathan > > > > > > On Wed, Sep 12, 2012 at 12:20 PM, David Kelly < > > davidk at ci.uchicago.edu > > > wrote: > > > > > > Jonathan, > > > > Could you please provide a pointer to the log file that got created > > from this run? > > > > Thanks, > > David > > > > > > > > ----- Original Message ----- > > > From: "Jonathan Margoliash" < jmargolpeople at gmail.com > > > > > > To: swift-user at ci.uchicago.edu , "Swift Language" < > > > davidk at ci.uchicago.edu >, "Professor E. Yan" < eyan at anl.gov > > > > Sent: Wednesday, September 12, 2012 10:50:35 AM > > > Subject: Getting swift to run on Fusion > > > Hello swift support, > > > > > > > > > This is my first attempt getting swift to work on Fusion, and I'm > > > getting the following output to the terminal: > > > > > > > > > ------ > > > > > > > > > > > > Warning: Function toint is deprecated, at line 10 > > > Swift trunk swift-r5882 cog-r3434 > > > > > > > > > RunID: 20120912-1032-5y7xb1ug > > > Progress: time: Wed, 12 Sep 2012 10:32:51 -0500 > > > Progress: time: Wed, 12 Sep 2012 10:32:54 -0500 Selecting site:34 > > > Submitted:8 > > > Progress: time: Wed, 12 Sep 2012 10:32:57 -0500 Selecting site:34 > > > Submitted:8 > > > Progress: time: Wed, 12 Sep 2012 10:33:00 -0500 Selecting site:34 > > > Submitted:8 > > > ... > > > Progress: time: Wed, 12 Sep 2012 10:40:33 -0500 Selecting site:34 > > > Submitted:8 > > > Failed to shut down block: Block 0912-321051-000005 (8x60.000s) > > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > > Failed to cancel task. qdel returned with an exit code of 153 > > > at > > > org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.cancel(AbstractExecutor.java:205) > > > at > > > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85) > > > at > > > org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:69) > > > at > > > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:102) > > > at > > > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:91) > > > at > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.cancel(BlockTaskSubmitter.java:46) > > > at > > > org.globus.cog.abstraction.coaster.service.job.manager.Block.forceShutdown(Block.java:320) > > > at > > > org.globus.cog.abstraction.coaster.service.job.manager.Node.errorReceived(Node.java:100) > > > at > > > org.globus.cog.karajan.workflow.service.commands.Command.errorReceived(Command.java:203) > > > at > > > org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyListeners(ChannelContext.java:237) > > > at > > > org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyRegisteredCommandsAndHandlers(ChannelContext.java:225) > > > at > > > org.globus.cog.karajan.workflow.service.channels.ChannelContext.channelShutDown(ChannelContext.java:318) > > > at > > > org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:293) > > > at > > > org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleChannelException(AbstractKarajanChannel.java:552) > > > at > > > org.globus.cog.karajan.workflow.service.channels.NIOSender.run(NIOSender.java:140) > > > Progress: time: Wed, 12 Sep 2012 10:40:36 -0500 Selecting site:34 > > > Submitted:8 > > > Progress: time: Wed, 12 Sep 2012 10:40:39 -0500 Selecting site:34 > > > Submitted:8 > > > Progress: time: Wed, 12 Sep 2012 10:40:42 -0500 Selecting site:34 > > > Submitted:8 > > > ... > > > > > > Progress: time: Wed, 12 Sep 2012 10:41:42 -0500 Selecting site:34 > > > Submitted:8 > > > Progress: time: Wed, 12 Sep 2012 10:41:45 -0500 Selecting site:34 > > > Submitted:8 > > > Failed to shut down block: Block 0912-321051-000006 (8x60.000s) > > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > > Failed to cancel task. qdel returned with an exit code of 153 > > > at > > > org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.cancel(AbstractExecutor.java:205) > > > at > > > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85) > > > at > > > org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:69) > > > at > > > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:102) > > > at > > > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:91) > > > at > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.cancel(BlockTaskSubmitter.java:46) > > > at > > > org.globus.cog.abstraction.coaster.service.job.manager.Block.forceShutdown(Block.java:320) > > > at > > > org.globus.cog.abstraction.coaster.service.job.manager.Node.errorReceived(Node.java:100) > > > at > > > org.globus.cog.karajan.workflow.service.commands.Command.errorReceived(Command.java:203) > > > at > > > org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyListeners(ChannelContext.java:237) > > > at > > > org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyRegisteredCommandsAndHandlers(ChannelContext.java:225) > > > at > > > org.globus.cog.karajan.workflow.service.channels.ChannelContext.channelShutDown(ChannelContext.java:318) > > > at > > > org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:293) > > > at > > > org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleChannelException(AbstractKarajanChannel.java:552) > > > at > > > org.globus.cog.karajan.workflow.service.channels.NIOSender.run(NIOSender.java:140) > > > Progress: time: Wed, 12 Sep 2012 10:41:48 -0500 Selecting site:34 > > > Submitted:8 > > > Progress: time: Wed, 12 Sep 2012 10:41:51 -0500 Selecting site:34 > > > Submitted:8 > > > Progress: time: Wed, 12 Sep 2012 10:41:54 -0500 Selecting site:34 > > > Submitted:8 > > > ... > > > > > > > > > ------ > > > > > > > > > I understand the long lines of unchanging "Progress: ..." reports > > > - > > > the shared queue is busy, and so I am not expecting my job to be > > > executed right away. However, I don't understand why I'm getting > > > these > > > "failed to cancel task" errors. I gave each individual app well > > > more > > > than enough time for it to run to completion. And while I set the > > > timelimit on the entire process to be much smaller than it needs > > > (60 in > > > sites.xml, > > > when the process could run for days) > > > I presumed the entire process would just get shut down after 60 > > > seconds of runtime. Why is this cropping up? Thanks, > > > > > > > > > Jonathan From jmargolpeople at gmail.com Thu Sep 13 10:13:46 2012 From: jmargolpeople at gmail.com (Jonathan Margoliash) Date: Thu, 13 Sep 2012 10:13:46 -0500 Subject: [Swift-user] Getting swift to run on Fusion In-Reply-To: <227424889.36340.1347548460809.JavaMail.root@zimbra-mb2.anl.gov> References: <227424889.36340.1347548460809.JavaMail.root@zimbra-mb2.anl.gov> Message-ID: David, I've now got swift working on Fusion! My most recent code in /home/jmargoliash/my_SwiftSCE2_branch runs swift, and manages to get inside app invocations before crashing. I'm still working out some kinks (specifically, the Fusion support team just installed octave on Fusion, and so I'm trying to get that to work). However, the swift component seems fine for now. I might have questions later about modifying the sites.xml file. As for my workflow: swift is only being called once at any given time, on the node which called ./fusion_start_sce.sh The workflow is: ./fusion_start_sce.sh -> octave/matlab code -> calculate_point_values.swift calculate_point_values.swift creates run_swat_wrapper.sh jobs run_swat_wrapper.sh -> run_swat_wrapper.m (octave/matlab) -> ... -> SWAT model There is a similar work flow when generate_offspring.swift is called from the control node, as opposed to calculate_point_values.swift (but the two are never called concurrently). One last question: I don't want to have my main matlab/octave method, the one that orchestrates the repeated calls to swift, running on the fusion login nodes. I fear that I would end up eating up too much of their resources. So how do I get it running on one of the compute nodes? Do I just qsub the ./fusion_start_sce.sh command? And if so, it would be the case that swift would then be invoked directly from the compute nodes, in an attempt to distribute jobs to other compute nodes. Would this work? Or would swift do something like trying to submit another job to the scheduler, as opposed to working within the current job? Thanks, Jonathan On Thu, Sep 13, 2012 at 10:01 AM, David Kelly wrote: > Jonathan, > > Just trying to understand the workflow better.. from what I can tell it is > currently something like this: > > You run the swift script called calculate_point_values.swift > calculate_point_values calls run_swat_wrapper, which runs matlab > matlab loads a file called run_swat_wrapper.m > > Is one of the matlab files launching instances of swift on the worker > nodes? > > David > > ----- Original Message ----- > > From: "Jonathan Margoliash" > > To: "David Kelly" > > Cc: swift-user at ci.uchicago.edu, "Professor E. Yan" > > Sent: Wednesday, September 12, 2012 4:46:20 PM > > Subject: Re: Getting swift to run on Fusion > > So, I am using your sites.xml file for a slightly newer version of the > > code (located at home/jmargoliash/my_SwiftSCE2_branch as opposed to > > home/jmargoliash/my_SwiftSCE2_branch_matlab ). When I run this > > version, I'm getting the following error out of matlab: > > > > > > > > ---- > > > > stderr.txt: Fatal Error on startup: Unable to start the JVM. > > Error occurred during initialization of VM > > Could not reserve enough space for object heap > > > > > > There is not enough memory to start up the Java virtual machine. > > Try quitting other applications or increasing your virtual memory. > > ----- > > > > > > My question is: is this error matlab's fault, or is this sites.xml > > file trying to run too many apps on each node at once? > > > > > > Also, a tangentially related questions about the coaster service: > > Why would you want to have more than one coaster worker running on > > each node? Or rather, does each coaster worker correspond to a single > > app invocation, or can one coaster worker manage many simultaneous app > > invocations on a single node? > > > > > > > > > > On Wed, Sep 12, 2012 at 3:44 PM, Jonathan Margoliash < > > jmargolpeople at gmail.com > wrote: > > > > > > Thanks David! > > > > I was expecting the jobs themselves to crash, but I wanted to get to > > that point in the debugging process. I'll try this script for now, and > > we'll figure out why the other one wasn't working. Thanks again, > > > > > > Jonathan > > > > > > > > > > On Wed, Sep 12, 2012 at 3:24 PM, David Kelly < davidk at ci.uchicago.edu > > > wrote: > > > > > > Jonathan, > > > > I think the error is related to something being misconfigured in > > sites.xml. When I tried the same version of sites.xml, I saw the same > > qdel error. I will look to see why that is, but in the meantime, can > > you please try using this sites.xml: > > > > > > > > > > > > 3600 > > 8 > > shared > > 100 > > 1 > > 2 > > 5.99 > > 10000 > > 100 > > 100 > > > /homes/davidk/my_SwiftSCE2_branch_matlab/runs/run-20120912-150613/swiftwork > > > > > > > > (Modify your workdirectory as needed). I added this to my copy of > > fusion_start_sce.sh, in > > /homes/davidk/my_SwiftSCE2_branch_matlab/fusion_start_sce.sh. It seems > > to work for me, at least in terms of submitting and reporting on the > > status of jobs. The jobs themselves fail because there are references > > to /usr/bin/octave in some scripts which doesn't exist on Fusion. > > Hopefully this should help you get a little further. > > > > > > Thanks, > > David > > > > ----- Original Message ----- > > > From: "Jonathan Margoliash" < jmargolpeople at gmail.com > > > > > > To: "David Kelly" < davidk at ci.uchicago.edu > > > > Cc: swift-user at ci.uchicago.edu , "Professor E. Yan" < eyan at anl.gov > > > > Sent: Wednesday, September 12, 2012 11:32:56 AM > > > > > Subject: Re: Getting swift to run on Fusion > > > I attached the .0.rlog, .log, .d and the swift.log files. Which of > > > those files do you use for debugging? And these files are all > > > located > > > in the directory > > > > > > > > > /home/jmargoliash/my_SwiftSCE2_branch_matlab/runs/run-20120912-103235 > > > > > > > > > on Fusion, if that's what you were asking for. Thanks! > > > > > > > > > Jonathan > > > > > > > > > On Wed, Sep 12, 2012 at 12:20 PM, David Kelly < > > > davidk at ci.uchicago.edu > > > > wrote: > > > > > > > > > Jonathan, > > > > > > Could you please provide a pointer to the log file that got created > > > from this run? > > > > > > Thanks, > > > David > > > > > > > > > > > > ----- Original Message ----- > > > > From: "Jonathan Margoliash" < jmargolpeople at gmail.com > > > > > > > > > To: swift-user at ci.uchicago.edu , "Swift Language" < > > > > davidk at ci.uchicago.edu >, "Professor E. Yan" < eyan at anl.gov > > > > > Sent: Wednesday, September 12, 2012 10:50:35 AM > > > > Subject: Getting swift to run on Fusion > > > > Hello swift support, > > > > > > > > > > > > This is my first attempt getting swift to work on Fusion, and I'm > > > > getting the following output to the terminal: > > > > > > > > > > > > ------ > > > > > > > > > > > > > > > > Warning: Function toint is deprecated, at line 10 > > > > Swift trunk swift-r5882 cog-r3434 > > > > > > > > > > > > RunID: 20120912-1032-5y7xb1ug > > > > Progress: time: Wed, 12 Sep 2012 10:32:51 -0500 > > > > Progress: time: Wed, 12 Sep 2012 10:32:54 -0500 Selecting site:34 > > > > Submitted:8 > > > > Progress: time: Wed, 12 Sep 2012 10:32:57 -0500 Selecting site:34 > > > > Submitted:8 > > > > Progress: time: Wed, 12 Sep 2012 10:33:00 -0500 Selecting site:34 > > > > Submitted:8 > > > > ... > > > > Progress: time: Wed, 12 Sep 2012 10:40:33 -0500 Selecting site:34 > > > > Submitted:8 > > > > Failed to shut down block: Block 0912-321051-000005 (8x60.000s) > > > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > > > Failed to cancel task. qdel returned with an exit code of 153 > > > > at > > > > > org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.cancel(AbstractExecutor.java:205) > > > > at > > > > > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85) > > > > at > > > > > org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:69) > > > > at > > > > > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:102) > > > > at > > > > > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:91) > > > > at > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.cancel(BlockTaskSubmitter.java:46) > > > > at > > > > > org.globus.cog.abstraction.coaster.service.job.manager.Block.forceShutdown(Block.java:320) > > > > at > > > > > org.globus.cog.abstraction.coaster.service.job.manager.Node.errorReceived(Node.java:100) > > > > at > > > > > org.globus.cog.karajan.workflow.service.commands.Command.errorReceived(Command.java:203) > > > > at > > > > > org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyListeners(ChannelContext.java:237) > > > > at > > > > > org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyRegisteredCommandsAndHandlers(ChannelContext.java:225) > > > > at > > > > > org.globus.cog.karajan.workflow.service.channels.ChannelContext.channelShutDown(ChannelContext.java:318) > > > > at > > > > > org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:293) > > > > at > > > > > org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleChannelException(AbstractKarajanChannel.java:552) > > > > at > > > > > org.globus.cog.karajan.workflow.service.channels.NIOSender.run(NIOSender.java:140) > > > > Progress: time: Wed, 12 Sep 2012 10:40:36 -0500 Selecting site:34 > > > > Submitted:8 > > > > Progress: time: Wed, 12 Sep 2012 10:40:39 -0500 Selecting site:34 > > > > Submitted:8 > > > > Progress: time: Wed, 12 Sep 2012 10:40:42 -0500 Selecting site:34 > > > > Submitted:8 > > > > ... > > > > > > > > Progress: time: Wed, 12 Sep 2012 10:41:42 -0500 Selecting site:34 > > > > Submitted:8 > > > > Progress: time: Wed, 12 Sep 2012 10:41:45 -0500 Selecting site:34 > > > > Submitted:8 > > > > Failed to shut down block: Block 0912-321051-000006 (8x60.000s) > > > > org.globus.cog.abstraction.impl.common.task.TaskSubmissionException: > > > > Failed to cancel task. qdel returned with an exit code of 153 > > > > at > > > > > org.globus.cog.abstraction.impl.scheduler.common.AbstractExecutor.cancel(AbstractExecutor.java:205) > > > > at > > > > > org.globus.cog.abstraction.impl.scheduler.common.AbstractJobSubmissionTaskHandler.cancel(AbstractJobSubmissionTaskHandler.java:85) > > > > at > > > > > org.globus.cog.abstraction.impl.common.AbstractTaskHandler.cancel(AbstractTaskHandler.java:69) > > > > at > > > > > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:102) > > > > at > > > > > org.globus.cog.abstraction.impl.common.task.ExecutionTaskHandler.cancel(ExecutionTaskHandler.java:91) > > > > at > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockTaskSubmitter.cancel(BlockTaskSubmitter.java:46) > > > > at > > > > > org.globus.cog.abstraction.coaster.service.job.manager.Block.forceShutdown(Block.java:320) > > > > at > > > > > org.globus.cog.abstraction.coaster.service.job.manager.Node.errorReceived(Node.java:100) > > > > at > > > > > org.globus.cog.karajan.workflow.service.commands.Command.errorReceived(Command.java:203) > > > > at > > > > > org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyListeners(ChannelContext.java:237) > > > > at > > > > > org.globus.cog.karajan.workflow.service.channels.ChannelContext.notifyRegisteredCommandsAndHandlers(ChannelContext.java:225) > > > > at > > > > > org.globus.cog.karajan.workflow.service.channels.ChannelContext.channelShutDown(ChannelContext.java:318) > > > > at > > > > > org.globus.cog.karajan.workflow.service.channels.ChannelManager.handleChannelException(ChannelManager.java:293) > > > > at > > > > > org.globus.cog.karajan.workflow.service.channels.AbstractKarajanChannel.handleChannelException(AbstractKarajanChannel.java:552) > > > > at > > > > > org.globus.cog.karajan.workflow.service.channels.NIOSender.run(NIOSender.java:140) > > > > Progress: time: Wed, 12 Sep 2012 10:41:48 -0500 Selecting site:34 > > > > Submitted:8 > > > > Progress: time: Wed, 12 Sep 2012 10:41:51 -0500 Selecting site:34 > > > > Submitted:8 > > > > Progress: time: Wed, 12 Sep 2012 10:41:54 -0500 Selecting site:34 > > > > Submitted:8 > > > > ... > > > > > > > > > > > > ------ > > > > > > > > > > > > I understand the long lines of unchanging "Progress: ..." reports > > > > - > > > > the shared queue is busy, and so I am not expecting my job to be > > > > executed right away. However, I don't understand why I'm getting > > > > these > > > > "failed to cancel task" errors. I gave each individual app well > > > > more > > > > than enough time for it to run to completion. And while I set the > > > > timelimit on the entire process to be much smaller than it needs > > > > (60 in > > > > sites.xml, > > > > when the process could run for days) > > > > I presumed the entire process would just get shut down after 60 > > > > seconds of runtime. Why is this cropping up? Thanks, > > > > > > > > > > > > Jonathan > -------------- next part -------------- An HTML attachment was scrubbed... URL: From davidk at ci.uchicago.edu Thu Sep 13 10:39:34 2012 From: davidk at ci.uchicago.edu (David Kelly) Date: Thu, 13 Sep 2012 10:39:34 -0500 (CDT) Subject: [Swift-user] Getting swift to run on Fusion In-Reply-To: Message-ID: <1733158597.36775.1347550774816.JavaMail.root@zimbra-mb2.anl.gov> One thing you can try is running qsub -I. That gives you an interactive session on one of the worker nodes on Fusion. Here is an example command that should work for you: qsub -I -l nodes=1:ppn=1 -l walltime=24:00:00 That will give you a shell on a Fusion worker node, using one processor on one node for a maximum of 24 hours. You can use that session to run your shell/matlab script. As it is configured now, Swift will not use that same interactive session for processing work - it will request new nodes and submit the work there. David From jmargolpeople at gmail.com Thu Sep 13 12:50:04 2012 From: jmargolpeople at gmail.com (Jonathan Margoliash) Date: Thu, 13 Sep 2012 12:50:04 -0500 Subject: [Swift-user] Coasters submitting too many jobs simultaneously Message-ID: Hey David, another question: When I run Swift on Fusion using the sites.xml file you sent me, Swift is scheduling many jobs on Fusion. Why is that? The sites.xml specifies and I thought the point of using coasters as the execution provider was to wrap all of my separate app calls into a single job submission. With swift scheduling so many jobs, it's hard to track down and manually abort them when I need to. Maybe this stems from my lack of understanding of the coaster system. I thought jobsPerNode limited the number of app calls the would be sent to any node at a given time. However, in looking back at the web page, I'm now thinking that maybe it limits the number of swift coaster workers on each node, while each swift coaster worker can run many apps at once. If that is true, then how do I limit the number of apps run on each node simultaneously? And if each swift worker can run many apps at once, why would I ever want jobsPerNode > 1? Also, does the slots variable have anything to do with this? If so, what does it do? For reference, the workdirectory for the swift call is /home/jmargoliash/my_SwiftSCE2_branch/runs/run-20120913-121403 Here's the output of a bunch of tests I ran while swift was going: -------------------------------- Sitest.xml: 3600 8 shared 100 1 2 5.99 10000 100 100 /home/jmargoliash/my_SwiftSCE2_branch/runs/run-20120913-121403/swiftwork --------------------------------- Terminal output from running swift: Entering swift from create_random_sample ---- text generated by my code Warning: Function toint is deprecated, at line 10 Progress: time: Thu, 13 Sep 2012 12:14:22 -0500 Progress: time: Thu, 13 Sep 2012 12:14:23 -0500 Initializing:1 Progress: time: Thu, 13 Sep 2012 12:14:24 -0500 Stage in:99 Submitting:1 Progress: time: Thu, 13 Sep 2012 12:14:25 -0500 Stage in:86 Submitting:1 Submitted:13 Progress: time: Thu, 13 Sep 2012 12:14:27 -0500 Submitted:99 Active:1 Progress: time: Thu, 13 Sep 2012 12:14:30 -0500 Submitted:91 Active:9 Progress: time: Thu, 13 Sep 2012 12:14:31 -0500 Submitted:59 Active:41 Progress: time: Thu, 13 Sep 2012 12:14:32 -0500 Submitted:27 Active:73 Progress: time: Thu, 13 Sep 2012 12:14:34 -0500 Submitted:12 Active:88 Progress: time: Thu, 13 Sep 2012 12:14:37 -0500 Submitted:12 Active:88 Progress: time: Thu, 13 Sep 2012 12:14:39 -0500 Submitted:11 Active:89 Progress: time: Thu, 13 Sep 2012 12:14:40 -0500 Submitted:4 Active:96 Progress: time: Thu, 13 Sep 2012 12:14:43 -0500 Submitted:4 Active:96 Progress: time: Thu, 13 Sep 2012 12:14:46 -0500 Submitted:4 Active:96 Progress: time: Thu, 13 Sep 2012 12:14:49 -0500 Submitted:4 Active:96 Progress: time: Thu, 13 Sep 2012 12:14:52 -0500 Submitted:4 Active:96 ... (Why are so many apps considered submitted/active at once? I only want 8 apps working per node at maximum (because each node only has 8 cores), and since maxNodes = 2 at the moment, I want active <= 16 at all times). ------- Output of show-q u $USER after swift has been killed manually: (Notice that a bunch of jobs are still going. Why doesn't swift shut them down automatically when it quits?) [jmargoliash at flogin3 my_SwiftSCE2_branch]$ showq -u $USER ACTIVE JOBS-------------------- JOBNAME USERNAME STATE PROC REMAINING STARTTIME 1289476 jmargoliash Running 1 00:58:44 Thu Sep 13 12:14:27 1289477 jmargoliash Running 1 00:58:46 Thu Sep 13 12:14:29 1289478 jmargoliash Running 1 00:58:46 Thu Sep 13 12:14:29 1289479 jmargoliash Running 1 00:58:47 Thu Sep 13 12:14:30 1289480 jmargoliash Running 1 00:58:47 Thu Sep 13 12:14:30 1289481 jmargoliash Running 2 00:58:47 Thu Sep 13 12:14:30 1289482 jmargoliash Running 2 00:58:47 Thu Sep 13 12:14:30 1289483 jmargoliash Running 2 00:58:48 Thu Sep 13 12:14:31 1289484 jmargoliash Running 1 00:58:48 Thu Sep 13 12:14:31 9 Active Jobs 2860 of 3088 Processors Active (92.62%) 343 of 346 Nodes Active (99.13%) IDLE JOBS---------------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME 0 Idle Jobs BLOCKED JOBS---------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME Total Jobs: 9 Active Jobs: 9 Idle Jobs: 0 Blocked Jobs: 0 [jmargoliash at flogin3 my_SwiftSCE2_branch]$ ------------------------- Output of ps -u $USER -H after swift has been killed: [jmargoliash at flogin3 my_SwiftSCE2_branch]$ ps -u $USER -H PID TTY TIME CMD 19603 ? 00:00:00 sshd 19604 pts/16 00:00:00 bash 17270 ? 00:00:00 sshd 17271 pts/25 00:00:00 bash 6495 pts/25 00:00:00 vim 16825 ? 00:00:00 sshd 16826 pts/34 00:00:00 bash 25813 pts/34 00:00:00 ps 4494 ? 00:00:00 sshd 4495 pts/1 00:00:00 bash 31023 pts/1 00:00:00 vim 24727 pts/16 00:00:00 qdel <----------- 20792 pts/16 00:00:00 check_on_swift. 20793 pts/16 00:00:00 sleep 19755 pts/16 00:00:00 tee You can see that a qdel command has been started after swift finished. (I'm pretty sure this is not a call that was left over hanging from when I called qdel earlier). I assume this is swift's attempt to shut down the processes it has started up as it exits. However, I presumed qdel would have a near-instantaneous return. Why is it hanging here? Is this a problem with fusion, or with my code? -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Thu Sep 13 15:11:30 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Thu, 13 Sep 2012 15:11:30 -0500 (CDT) Subject: [Swift-user] Coasters submitting too many jobs simultaneously In-Reply-To: Message-ID: <2031085776.15338.1347567090095.JavaMail.root@zimbra.anl.gov> Hi Jonathan, To clarify what your sites file below is requesting from coasters: (note that I've changed the order of the XML tags to better explain them) shared Send all jobs to the shared queue. On Fusion, I think PBS runs multiple jobs per node on this queue. So while the shared queue is good for fester testing, it may conflict with what Swift is requesting per the tags below. 100 Submit up to 100 jobs to PBS at a time. 1 2 PBS jobs will request up to 2 nodes. This may interact in odd ways with the shared queue - not sure, need to investigate that. Some jobs may be 1 node, others 2, depending on how many app requests your Swift script is making at the point when coasters decides to batch up the current requests into PBS jobs. 8 Run 8 swift apps concurrently on every PBS node. This will definitely be more than you want in the shared queue, where PBS is (I *think*) giving you just 1 *core*. 5.99 10000 Run up to 600 app calls at once on this site. 100 100 3600 Make all job slots request the max amount of time (maxtime) - Mike ----- Original Message ----- > From: "Jonathan Margoliash" > To: "Swift Language" , swift-user at ci.uchicago.edu, "Professor E. Yan" , "Michael > Wilde" > Sent: Thursday, September 13, 2012 12:50:04 PM > Subject: Coasters submitting too many jobs simultaneously > Hey David, another question: > > > When I run Swift on Fusion using the sites.xml file you sent me, Swift > is scheduling many jobs on Fusion. Why is that? The sites.xml > specifies > > and I thought the point of using coasters as the execution provider > was to wrap all of my separate app calls into a single job submission. > With swift scheduling so many jobs, it's hard to track down and > manually abort them when I need to. > > > Maybe this stems from my lack of understanding of the coaster system. > I thought jobsPerNode limited the number of app calls the would be > sent to any node at a given time. However, in looking back at the web > page, I'm now thinking that maybe it limits the number of swift > coaster workers on each node, while each swift coaster worker can run > many apps at once. If that is true, then how do I limit the number of > apps run on each node simultaneously? And if each swift worker can run > many apps at once, why would I ever want jobsPerNode > 1? Also, does > the slots variable have anything to do with this? If so, what does it > do? > > > For reference, the workdirectory for the swift call is > /home/jmargoliash/my_SwiftSCE2_branch/runs/run-20120913-121403 > Here's the output of a bunch of tests I ran while swift was going: > > > > -------------------------------- > Sitest.xml: > > > > > > > > 3600 > 8 > shared > 100 > 1 > 2 > 5.99 > 10000 > 100 > 100 > /home/jmargoliash/my_SwiftSCE2_branch/runs/run-20120913-121403/swiftwork > > > > --------------------------------- > Terminal output from running swift: > > > > Entering swift from create_random_sample ---- text generated by my > code > Warning: Function toint is deprecated, at line 10 > Progress: time: Thu, 13 Sep 2012 12:14:22 -0500 > Progress: time: Thu, 13 Sep 2012 12:14:23 -0500 Initializing:1 > Progress: time: Thu, 13 Sep 2012 12:14:24 -0500 Stage in:99 > Submitting:1 > Progress: time: Thu, 13 Sep 2012 12:14:25 -0500 Stage in:86 > Submitting:1 Submitted:13 > Progress: time: Thu, 13 Sep 2012 12:14:27 -0500 Submitted:99 Active:1 > Progress: time: Thu, 13 Sep 2012 12:14:30 -0500 Submitted:91 Active:9 > Progress: time: Thu, 13 Sep 2012 12:14:31 -0500 Submitted:59 Active:41 > Progress: time: Thu, 13 Sep 2012 12:14:32 -0500 Submitted:27 Active:73 > Progress: time: Thu, 13 Sep 2012 12:14:34 -0500 Submitted:12 Active:88 > Progress: time: Thu, 13 Sep 2012 12:14:37 -0500 Submitted:12 Active:88 > Progress: time: Thu, 13 Sep 2012 12:14:39 -0500 Submitted:11 Active:89 > Progress: time: Thu, 13 Sep 2012 12:14:40 -0500 Submitted:4 Active:96 > Progress: time: Thu, 13 Sep 2012 12:14:43 -0500 Submitted:4 Active:96 > Progress: time: Thu, 13 Sep 2012 12:14:46 -0500 Submitted:4 Active:96 > Progress: time: Thu, 13 Sep 2012 12:14:49 -0500 Submitted:4 Active:96 > Progress: time: Thu, 13 Sep 2012 12:14:52 -0500 Submitted:4 Active:96 > ... > > > (Why are so many apps considered submitted/active at once? I only want > 8 apps working per node at maximum (because each node only has 8 > cores), and since maxNodes = 2 at the moment, I want active <= 16 at > all times). > > > ------- > Output of show-q u $USER after swift has been killed manually: (Notice > that a bunch of jobs are still going. Why doesn't swift shut them down > automatically when it quits?) > > > [jmargoliash at flogin3 my_SwiftSCE2_branch]$ showq -u $USER > ACTIVE JOBS-------------------- > JOBNAME USERNAME STATE PROC REMAINING STARTTIME > > > 1289476 jmargoliash Running 1 00:58:44 Thu Sep 13 12:14:27 > 1289477 jmargoliash Running 1 00:58:46 Thu Sep 13 12:14:29 > 1289478 jmargoliash Running 1 00:58:46 Thu Sep 13 12:14:29 > 1289479 jmargoliash Running 1 00:58:47 Thu Sep 13 12:14:30 > 1289480 jmargoliash Running 1 00:58:47 Thu Sep 13 12:14:30 > 1289481 jmargoliash Running 2 00:58:47 Thu Sep 13 12:14:30 > 1289482 jmargoliash Running 2 00:58:47 Thu Sep 13 12:14:30 > 1289483 jmargoliash Running 2 00:58:48 Thu Sep 13 12:14:31 > 1289484 jmargoliash Running 1 00:58:48 Thu Sep 13 12:14:31 > > > 9 Active Jobs 2860 of 3088 Processors Active (92.62%) > 343 of 346 Nodes Active (99.13%) > > > IDLE JOBS---------------------- > JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME > > > > > 0 Idle Jobs > > > BLOCKED JOBS---------------- > JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME > > > > > Total Jobs: 9 Active Jobs: 9 Idle Jobs: 0 Blocked Jobs: 0 > [jmargoliash at flogin3 my_SwiftSCE2_branch]$ > > > > > ------------------------- > Output of ps -u $USER -H after swift has been killed: > > > > [jmargoliash at flogin3 my_SwiftSCE2_branch]$ ps -u $USER -H > PID TTY TIME CMD > 19603 ? 00:00:00 sshd > 19604 pts/16 00:00:00 bash > 17270 ? 00:00:00 sshd > 17271 pts/25 00:00:00 bash > 6495 pts/25 00:00:00 vim > 16825 ? 00:00:00 sshd > 16826 pts/34 00:00:00 bash > 25813 pts/34 00:00:00 ps > 4494 ? 00:00:00 sshd > 4495 pts/1 00:00:00 bash > 31023 pts/1 00:00:00 vim > 24727 pts/16 00:00:00 qdel <----------- > 20792 pts/16 00:00:00 check_on_swift. > 20793 pts/16 00:00:00 sleep > 19755 pts/16 00:00:00 tee > > > You can see that a qdel command has been started after swift finished. > (I'm pretty sure this is not a call that was left over hanging from > when I called qdel earlier). I assume this is swift's attempt to shut > down the processes it has started up as it exits. However, I presumed > qdel would have a near-instantaneous return. Why is it hanging here? > Is this a problem with fusion, or with my code? -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From jmargolpeople at gmail.com Fri Sep 14 08:47:56 2012 From: jmargolpeople at gmail.com (Jonathan Margoliash) Date: Fri, 14 Sep 2012 08:47:56 -0500 Subject: [Swift-user] Swift progress reports are too frequent Message-ID: Hello swift support, I just downloaded the most recent version of the swift trunk, and now Swift prints to a progress report every 3 seconds. (Something of the form "Progress: .... ") Beforehand, swift printed something every 30 seconds, and that made scrolling through swift's output much more manageable. How can I change this back, so that swift delivers reports much less often? Jonathan -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at mcs.anl.gov Fri Sep 14 09:48:56 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 14 Sep 2012 09:48:56 -0500 (CDT) Subject: [Swift-user] Swift error with no explanation In-Reply-To: Message-ID: <106255828.16462.1347634136884.JavaMail.root@zimbra.anl.gov> Jonathan, it seems that there's some kind of error cascade here. You have an application script that repeatedly runs Swift. Several Swift runs complete successfully. Then one run encounters a Java null pointer exception in Swift, seemingly in a method related to logging progress of the run. Close to the same time, one of your app() calls is failing. After that, all the Swift runs fail due to an app() call thats failing in a different manner. Im moving this discussion to swift-support to avoid clogging up swift-user with lots of debugging discussion. Please send us (on swift-support) a pointer to where we can locate your run directories. I'd like to find the log file for RunID: 20120914-0423-eag9170a, and send that to Mihael to debug the null pointer exception. I'd like to look closer at the subsequent runs to see whats failing. Im thinking something went wrong in your application environment thats causing the initial app failure, which in turn may have tripped into some intermittent Swift bug. - Mike ----- Original Message ----- > From: "Jonathan Margoliash" > To: swift-user at ci.uchicago.edu, "Swift Language" , "Michael Wilde" , > "Professor E. Yan" > Sent: Friday, September 14, 2012 9:25:41 AM > Subject: Swift error with no explanation > Hello swift support, > > > Last night I was running a swift script, and it crashed for no reason > I can discern. The attached file terminal_output.txt is the output > (both stdout and stderr) that was printed during the run. To the best > of my knowledge, the first error crops up on line 18192. The text is > "Execution failed: java stack trace with NullPointerException at the > top". What's more confusing is that swift printed the following to the > terminal: > > > RunID: 20120914-0423-eag9170a > Failed to transfer wrapper log for job Nelder_Mead-mnsj53yk > EXCEPTION Exception in Nelder_Mead: > Arguments: [SCE_Par.mat, simplex7_3.mat, new_point7_3_1.mat, 1, 151, > run_swat.m] > Host: thwomp > Directory: > generate_offspring-20120914-0423-eag9170a/jobs/m/Nelder_Mead-mnsj53yk > stderr.txt: > > > stdout.txt: > > > ---- > > > The app that I was running (Nelder_Mead.sh) prints a line to the > stdout as the first thing it does, and since swift registered nothing > being written to stdout.txt, my only conclusion is that my app was > never run, and that this wasn't a problem with my code. I've attached > two log files to this email, in hopes that one of them corresponds to > this swift run, but neither of them have the right starting time (too > soon or too late), and my guess is that the line "Failed to transfer > wrapper log for job Nelder_Mead-mnsj53yk" indicates that no logfile > was generated for this call to swift. Also, please ignore the output > to terminal_output.txt generated after the swift-crash. My program > kept running and making more calls to swift, but I cannot predict what > it was doing given that I don't know what happened with swift, and I > would have rather it had just crashed when swift did. > > > Anyway, thanks for the help, > > > Jonathan -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Fri Sep 14 09:54:34 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 14 Sep 2012 09:54:34 -0500 (CDT) Subject: [Swift-user] Fwd: Swift progress reports are too frequent In-Reply-To: Message-ID: <729621291.16483.1347634474241.JavaMail.root@zimbra.anl.gov> Im moving this thread to swift-devel. Mihael or David, can you tell us (or determine) why the progress message frequency changed from 30 secs to 3 secs? Should revert back to 30? Thanks, - Mike ----- Forwarded Message ----- From: "Jonathan Margoliash" To: swift-user at ci.uchicago.edu Sent: Friday, September 14, 2012 8:47:56 AM Subject: [Swift-user] Swift progress reports are too frequent Hello swift support, I just downloaded the most recent version of the swift trunk, and now Swift prints to a progress report every 3 seconds. (Something of the form "Progress: .... ") Beforehand, swift printed something every 30 seconds, and that made scrolling through swift's output much more manageable. How can I change this back, so that swift delivers reports much less often? Jonathan _______________________________________________ Swift-user mailing list Swift-user at ci.uchicago.edu https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Fri Sep 14 13:14:12 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 14 Sep 2012 13:14:12 -0500 (CDT) Subject: [Swift-user] Custom/External Mappers In-Reply-To: <97B8C98AF6CD2146ADC0FEB4066CD50602A96535@xm-mbx-04-prod.ad.uchicago.edu> Message-ID: <1421925035.17096.1347646452963.JavaMail.root@zimbra.anl.gov> Robin, I looked into this problem a bit. It seems that your external mapper scripts are not returning anything on standard output (and an error message on stderr) for the arguments that you are passing in the failing cases. Here's what my tests show: $ ./clusMapper_name.sh [0] out/.0.out [1] out/.1.out [2] out/.2.out [3] out/.3.out $ $ ./clusMapper_name.sh -name=name_here ./clusMapper_name.sh: bad mapper args $ $ ./clusMapper_loop.sh -nfiles=3 ./clusMapper_loop.sh: bad mapper args $ I dont see exactly why Swift is hanging in the cases where the mappers return nothing. I would have thought a nicer behavior would be to (a) exit complaining that the external mapper didnt return with exit code 0, or (b) proceed as if the arrays were not mapped, and use the default (concurrent) mapper. I suspect whats happening is that Swift is not closing the array when the mapper fails, or some similar synchronization error. We'll need to look into this. I've filed this as bugzilla bug # 829. - Mike ----- Original Message ----- > From: "Robin Weiss" > To: swift-user at ci.uchicago.edu > Sent: Monday, September 10, 2012 4:56:45 PM > Subject: [Swift-user] Custom/External Mappers > Howdy swifters, > > > I am having some issues getting external mappers to work. In > particular, swift appears to hang when you use a custom mapper that > takes in one or more command line arguments. > > > Attached is a tar ball with a basic example of the issue. You can > uncomment each line in the mapper.swift script to see the behavior. I > have included 4 versions of my external mapper, two work and two do > not (see mapper.swift). Also included are the config, sites (using > localhost), and tc files I'm using. > > > I noticed this issue after moving from version 0.93 to trunk (Swift > trunk swift-r5917 cog-r3463) when some scripts I had been using > started to hang. > > > Thanks in advance, > Robin > > > > > -- > > Robin M. Weiss > Research Programmer > Research Computing Center > The University of Chicago > 6030 S. Ellis Ave., Suite 289C > Chicago, IL 60637 > robinweiss at uchicago.edu > 773.702.9030 > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From robinweiss at uchicago.edu Fri Sep 14 13:26:39 2012 From: robinweiss at uchicago.edu (Robin Weiss) Date: Fri, 14 Sep 2012 18:26:39 +0000 Subject: [Swift-user] Custom/External Mappers In-Reply-To: <1421925035.17096.1347646452963.JavaMail.root@zimbra.anl.gov> Message-ID: <97B8C98AF6CD2146ADC0FEB4066CD50602A96AED@xm-mbx-04-prod.ad.uchicago.edu> Interesting. I wrote the custom mapper script expecting swift to call it with a whitespace (not equal sign) between the parameter switch and the value that goes with it (I.e.: ./clusMapper_name.sh -name name_goes_here). This is how the example in the swift users guide for 0.93 (http://www.ci.uchicago.edu/swift/guides/release-0.93/userguide/userguide.h tml#_external_mapper) does it. The example from the users guide will also chokes if you put an equal sign instead of whitespace between the param name switch and the value. This issue popped up after we updated to the latest trunk version from 0.93. Did the way swift calls external mappers change (I.e. Equal sign separated instead of whitespace between param name and value)? Robin -- Robin M. Weiss Research Programmer Research Computing Center The University of Chicago 6030 S. Ellis Ave., Suite 289C Chicago, IL 60637 robinweiss at uchicago.edu 773.702.9030 On 9/14/12 1:14 PM, "Michael Wilde" wrote: >Robin, I looked into this problem a bit. It seems that your external >mapper scripts are not returning anything on standard output (and an >error message on stderr) for the arguments that you are passing in the >failing cases. > >Here's what my tests show: > >$ ./clusMapper_name.sh >[0] out/.0.out >[1] out/.1.out >[2] out/.2.out >[3] out/.3.out >$ > >$ ./clusMapper_name.sh -name=name_here >./clusMapper_name.sh: bad mapper args >$ > >$ ./clusMapper_loop.sh -nfiles=3 >./clusMapper_loop.sh: bad mapper args >$ > >I dont see exactly why Swift is hanging in the cases where the mappers >return nothing. I would have thought a nicer behavior would be to (a) >exit complaining that the external mapper didnt return with exit code 0, >or (b) proceed as if the arrays were not mapped, and use the default >(concurrent) mapper. > >I suspect whats happening is that Swift is not closing the array when the >mapper fails, or some similar synchronization error. We'll need to look >into this. I've filed this as bugzilla bug # 829. > >- Mike > >----- Original Message ----- >> From: "Robin Weiss" >> To: swift-user at ci.uchicago.edu >> Sent: Monday, September 10, 2012 4:56:45 PM >> Subject: [Swift-user] Custom/External Mappers >> Howdy swifters, >> >> >> I am having some issues getting external mappers to work. In >> particular, swift appears to hang when you use a custom mapper that >> takes in one or more command line arguments. >> >> >> Attached is a tar ball with a basic example of the issue. You can >> uncomment each line in the mapper.swift script to see the behavior. I >> have included 4 versions of my external mapper, two work and two do >> not (see mapper.swift). Also included are the config, sites (using >> localhost), and tc files I'm using. >> >> >> I noticed this issue after moving from version 0.93 to trunk (Swift >> trunk swift-r5917 cog-r3463) when some scripts I had been using >> started to hang. >> >> >> Thanks in advance, >> Robin >> >> >> >> >> -- >> >> Robin M. Weiss >> Research Programmer >> Research Computing Center >> The University of Chicago >> 6030 S. Ellis Ave., Suite 289C >> Chicago, IL 60637 >> robinweiss at uchicago.edu >> 773.702.9030 >> _______________________________________________ >> Swift-user mailing list >> Swift-user at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > >-- >Michael Wilde >Computation Institute, University of Chicago >Mathematics and Computer Science Division >Argonne National Laboratory > From wilde at mcs.anl.gov Fri Sep 14 13:32:44 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 14 Sep 2012 13:32:44 -0500 (CDT) Subject: [Swift-user] Custom/External Mappers In-Reply-To: <97B8C98AF6CD2146ADC0FEB4066CD50602A96AED@xm-mbx-04-prod.ad.uchicago.edu> Message-ID: <1706510066.17122.1347647564743.JavaMail.root@zimbra.anl.gov> Ooops, my mistake, sorry. You are correct, the ext mapper args are passed as "-argname value". The user guide is correct and that did not likely change in trunk. So the problem is elsewhere. I will investigate further. - Mike ----- Original Message ----- > From: "Robin Weiss" > To: "Michael Wilde" , swift-user at ci.uchicago.edu > Sent: Friday, September 14, 2012 1:26:39 PM > Subject: Re: [Swift-user] Custom/External Mappers > Interesting. I wrote the custom mapper script expecting swift to call > it > with a whitespace (not equal sign) between the parameter switch and > the > value that goes with it (I.e.: ./clusMapper_name.sh -name > name_goes_here). > This is how the example in the swift users guide for 0.93 > (http://www.ci.uchicago.edu/swift/guides/release-0.93/userguide/userguide.h > tml#_external_mapper) does it. The example from the users guide will > also > chokes if you put an equal sign instead of whitespace between the > param > name switch and the value. > > This issue popped up after we updated to the latest trunk version from > 0.93. Did the way swift calls external mappers change (I.e. Equal sign > separated instead of whitespace between param name and value)? > > Robin > > > -- > Robin M. Weiss > Research Programmer > Research Computing Center > The University of Chicago > 6030 S. Ellis Ave., Suite 289C > Chicago, IL 60637 > robinweiss at uchicago.edu > 773.702.9030 > > > > > > > On 9/14/12 1:14 PM, "Michael Wilde" wrote: > > >Robin, I looked into this problem a bit. It seems that your external > >mapper scripts are not returning anything on standard output (and an > >error message on stderr) for the arguments that you are passing in > >the > >failing cases. > > > >Here's what my tests show: > > > >$ ./clusMapper_name.sh > >[0] out/.0.out > >[1] out/.1.out > >[2] out/.2.out > >[3] out/.3.out > >$ > > > >$ ./clusMapper_name.sh -name=name_here > >./clusMapper_name.sh: bad mapper args > >$ > > > >$ ./clusMapper_loop.sh -nfiles=3 > >./clusMapper_loop.sh: bad mapper args > >$ > > > >I dont see exactly why Swift is hanging in the cases where the > >mappers > >return nothing. I would have thought a nicer behavior would be to (a) > >exit complaining that the external mapper didnt return with exit code > >0, > >or (b) proceed as if the arrays were not mapped, and use the default > >(concurrent) mapper. > > > >I suspect whats happening is that Swift is not closing the array when > >the > >mapper fails, or some similar synchronization error. We'll need to > >look > >into this. I've filed this as bugzilla bug # 829. > > > >- Mike > > > >----- Original Message ----- > >> From: "Robin Weiss" > >> To: swift-user at ci.uchicago.edu > >> Sent: Monday, September 10, 2012 4:56:45 PM > >> Subject: [Swift-user] Custom/External Mappers > >> Howdy swifters, > >> > >> > >> I am having some issues getting external mappers to work. In > >> particular, swift appears to hang when you use a custom mapper that > >> takes in one or more command line arguments. > >> > >> > >> Attached is a tar ball with a basic example of the issue. You can > >> uncomment each line in the mapper.swift script to see the behavior. > >> I > >> have included 4 versions of my external mapper, two work and two do > >> not (see mapper.swift). Also included are the config, sites (using > >> localhost), and tc files I'm using. > >> > >> > >> I noticed this issue after moving from version 0.93 to trunk (Swift > >> trunk swift-r5917 cog-r3463) when some scripts I had been using > >> started to hang. > >> > >> > >> Thanks in advance, > >> Robin > >> > >> > >> > >> > >> -- > >> > >> Robin M. Weiss > >> Research Programmer > >> Research Computing Center > >> The University of Chicago > >> 6030 S. Ellis Ave., Suite 289C > >> Chicago, IL 60637 > >> robinweiss at uchicago.edu > >> 773.702.9030 > >> _______________________________________________ > >> Swift-user mailing list > >> Swift-user at ci.uchicago.edu > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > >-- > >Michael Wilde > >Computation Institute, University of Chicago > >Mathematics and Computer Science Division > >Argonne National Laboratory > > -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Fri Sep 14 13:43:42 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 14 Sep 2012 13:43:42 -0500 (CDT) Subject: [Swift-user] Custom/External Mappers In-Reply-To: <1706510066.17122.1347647564743.JavaMail.root@zimbra.anl.gov> Message-ID: <1258978443.17147.1347648222494.JavaMail.root@zimbra.anl.gov> OK, I'm revising my diagnosis here. It looks like a change was introduced into trunk which is adding an extra argument to the ext mapper invocation giving the line number in the .swift script from which it was called. This looks to me like some developer debugging that got committed by mistake. We will post to this list when its removed. You can see the problem by inserting a debug echo into your mapper: in /autonfs/home/wilde/swift/lab/rwmapper.2012.0914/clusMapper_name.sh -name name_here -line 16 Your mapper (not surprisingly) rejects the -line 16 argument and fails. That said, Swift needs to do a better job in handling and reporting failures in external mappers. - Mike ----- Original Message ----- > From: "Michael Wilde" > To: "Robin Weiss" > Cc: swift-user at ci.uchicago.edu > Sent: Friday, September 14, 2012 1:32:44 PM > Subject: Re: [Swift-user] Custom/External Mappers > Ooops, my mistake, sorry. > > You are correct, the ext mapper args are passed as "-argname value". > The user guide is correct and that did not likely change in trunk. > > So the problem is elsewhere. I will investigate further. > > - Mike > > > ----- Original Message ----- > > From: "Robin Weiss" > > To: "Michael Wilde" , swift-user at ci.uchicago.edu > > Sent: Friday, September 14, 2012 1:26:39 PM > > Subject: Re: [Swift-user] Custom/External Mappers > > Interesting. I wrote the custom mapper script expecting swift to > > call > > it > > with a whitespace (not equal sign) between the parameter switch and > > the > > value that goes with it (I.e.: ./clusMapper_name.sh -name > > name_goes_here). > > This is how the example in the swift users guide for 0.93 > > (http://www.ci.uchicago.edu/swift/guides/release-0.93/userguide/userguide.h > > tml#_external_mapper) does it. The example from the users guide will > > also > > chokes if you put an equal sign instead of whitespace between the > > param > > name switch and the value. > > > > This issue popped up after we updated to the latest trunk version > > from > > 0.93. Did the way swift calls external mappers change (I.e. Equal > > sign > > separated instead of whitespace between param name and value)? > > > > Robin > > > > > > -- > > Robin M. Weiss > > Research Programmer > > Research Computing Center > > The University of Chicago > > 6030 S. Ellis Ave., Suite 289C > > Chicago, IL 60637 > > robinweiss at uchicago.edu > > 773.702.9030 > > > > > > > > > > > > > > On 9/14/12 1:14 PM, "Michael Wilde" wrote: > > > > >Robin, I looked into this problem a bit. It seems that your > > >external > > >mapper scripts are not returning anything on standard output (and > > >an > > >error message on stderr) for the arguments that you are passing in > > >the > > >failing cases. > > > > > >Here's what my tests show: > > > > > >$ ./clusMapper_name.sh > > >[0] out/.0.out > > >[1] out/.1.out > > >[2] out/.2.out > > >[3] out/.3.out > > >$ > > > > > >$ ./clusMapper_name.sh -name=name_here > > >./clusMapper_name.sh: bad mapper args > > >$ > > > > > >$ ./clusMapper_loop.sh -nfiles=3 > > >./clusMapper_loop.sh: bad mapper args > > >$ > > > > > >I dont see exactly why Swift is hanging in the cases where the > > >mappers > > >return nothing. I would have thought a nicer behavior would be to > > >(a) > > >exit complaining that the external mapper didnt return with exit > > >code > > >0, > > >or (b) proceed as if the arrays were not mapped, and use the > > >default > > >(concurrent) mapper. > > > > > >I suspect whats happening is that Swift is not closing the array > > >when > > >the > > >mapper fails, or some similar synchronization error. We'll need to > > >look > > >into this. I've filed this as bugzilla bug # 829. > > > > > >- Mike > > > > > >----- Original Message ----- > > >> From: "Robin Weiss" > > >> To: swift-user at ci.uchicago.edu > > >> Sent: Monday, September 10, 2012 4:56:45 PM > > >> Subject: [Swift-user] Custom/External Mappers > > >> Howdy swifters, > > >> > > >> > > >> I am having some issues getting external mappers to work. In > > >> particular, swift appears to hang when you use a custom mapper > > >> that > > >> takes in one or more command line arguments. > > >> > > >> > > >> Attached is a tar ball with a basic example of the issue. You can > > >> uncomment each line in the mapper.swift script to see the > > >> behavior. > > >> I > > >> have included 4 versions of my external mapper, two work and two > > >> do > > >> not (see mapper.swift). Also included are the config, sites > > >> (using > > >> localhost), and tc files I'm using. > > >> > > >> > > >> I noticed this issue after moving from version 0.93 to trunk > > >> (Swift > > >> trunk swift-r5917 cog-r3463) when some scripts I had been using > > >> started to hang. > > >> > > >> > > >> Thanks in advance, > > >> Robin > > >> > > >> > > >> > > >> > > >> -- > > >> > > >> Robin M. Weiss > > >> Research Programmer > > >> Research Computing Center > > >> The University of Chicago > > >> 6030 S. Ellis Ave., Suite 289C > > >> Chicago, IL 60637 > > >> robinweiss at uchicago.edu > > >> 773.702.9030 > > >> _______________________________________________ > > >> Swift-user mailing list > > >> Swift-user at ci.uchicago.edu > > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > >-- > > >Michael Wilde > > >Computation Institute, University of Chicago > > >Mathematics and Computer Science Division > > >Argonne National Laboratory > > > > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Fri Sep 14 13:58:58 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Fri, 14 Sep 2012 13:58:58 -0500 (CDT) Subject: [Swift-user] Custom/External Mappers In-Reply-To: <1258978443.17147.1347648222494.JavaMail.root@zimbra.anl.gov> Message-ID: <1491733435.17196.1347649138298.JavaMail.root@zimbra.anl.gov> Robin, adding this line to your mapper to handle the unexpected -line argument makes the example you sent work now: -line) ;; I was unable to find where this change got into trunk, but I will ask on the swift-devel list (and in bug 829) to back off the change. - Mike ----- Original Message ----- > From: "Michael Wilde" > To: "Robin Weiss" > Cc: swift-user at ci.uchicago.edu > Sent: Friday, September 14, 2012 1:43:42 PM > Subject: Re: [Swift-user] Custom/External Mappers > OK, I'm revising my diagnosis here. It looks like a change was > introduced into trunk which is adding an extra argument to the ext > mapper invocation giving the line number in the .swift script from > which it was called. This looks to me like some developer debugging > that got committed by mistake. We will post to this list when its > removed. You can see the problem by inserting a debug echo into your > mapper: > > in /autonfs/home/wilde/swift/lab/rwmapper.2012.0914/clusMapper_name.sh > -name name_here -line 16 > > Your mapper (not surprisingly) rejects the -line 16 argument and > fails. > > That said, Swift needs to do a better job in handling and reporting > failures in external mappers. > > - Mike > > ----- Original Message ----- > > From: "Michael Wilde" > > To: "Robin Weiss" > > Cc: swift-user at ci.uchicago.edu > > Sent: Friday, September 14, 2012 1:32:44 PM > > Subject: Re: [Swift-user] Custom/External Mappers > > Ooops, my mistake, sorry. > > > > You are correct, the ext mapper args are passed as "-argname value". > > The user guide is correct and that did not likely change in trunk. > > > > So the problem is elsewhere. I will investigate further. > > > > - Mike > > > > > > ----- Original Message ----- > > > From: "Robin Weiss" > > > To: "Michael Wilde" , > > > swift-user at ci.uchicago.edu > > > Sent: Friday, September 14, 2012 1:26:39 PM > > > Subject: Re: [Swift-user] Custom/External Mappers > > > Interesting. I wrote the custom mapper script expecting swift to > > > call > > > it > > > with a whitespace (not equal sign) between the parameter switch > > > and > > > the > > > value that goes with it (I.e.: ./clusMapper_name.sh -name > > > name_goes_here). > > > This is how the example in the swift users guide for 0.93 > > > (http://www.ci.uchicago.edu/swift/guides/release-0.93/userguide/userguide.h > > > tml#_external_mapper) does it. The example from the users guide > > > will > > > also > > > chokes if you put an equal sign instead of whitespace between the > > > param > > > name switch and the value. > > > > > > This issue popped up after we updated to the latest trunk version > > > from > > > 0.93. Did the way swift calls external mappers change (I.e. Equal > > > sign > > > separated instead of whitespace between param name and value)? > > > > > > Robin > > > > > > > > > -- > > > Robin M. Weiss > > > Research Programmer > > > Research Computing Center > > > The University of Chicago > > > 6030 S. Ellis Ave., Suite 289C > > > Chicago, IL 60637 > > > robinweiss at uchicago.edu > > > 773.702.9030 > > > > > > > > > > > > > > > > > > > > > On 9/14/12 1:14 PM, "Michael Wilde" wrote: > > > > > > >Robin, I looked into this problem a bit. It seems that your > > > >external > > > >mapper scripts are not returning anything on standard output (and > > > >an > > > >error message on stderr) for the arguments that you are passing > > > >in > > > >the > > > >failing cases. > > > > > > > >Here's what my tests show: > > > > > > > >$ ./clusMapper_name.sh > > > >[0] out/.0.out > > > >[1] out/.1.out > > > >[2] out/.2.out > > > >[3] out/.3.out > > > >$ > > > > > > > >$ ./clusMapper_name.sh -name=name_here > > > >./clusMapper_name.sh: bad mapper args > > > >$ > > > > > > > >$ ./clusMapper_loop.sh -nfiles=3 > > > >./clusMapper_loop.sh: bad mapper args > > > >$ > > > > > > > >I dont see exactly why Swift is hanging in the cases where the > > > >mappers > > > >return nothing. I would have thought a nicer behavior would be to > > > >(a) > > > >exit complaining that the external mapper didnt return with exit > > > >code > > > >0, > > > >or (b) proceed as if the arrays were not mapped, and use the > > > >default > > > >(concurrent) mapper. > > > > > > > >I suspect whats happening is that Swift is not closing the array > > > >when > > > >the > > > >mapper fails, or some similar synchronization error. We'll need > > > >to > > > >look > > > >into this. I've filed this as bugzilla bug # 829. > > > > > > > >- Mike > > > > > > > >----- Original Message ----- > > > >> From: "Robin Weiss" > > > >> To: swift-user at ci.uchicago.edu > > > >> Sent: Monday, September 10, 2012 4:56:45 PM > > > >> Subject: [Swift-user] Custom/External Mappers > > > >> Howdy swifters, > > > >> > > > >> > > > >> I am having some issues getting external mappers to work. In > > > >> particular, swift appears to hang when you use a custom mapper > > > >> that > > > >> takes in one or more command line arguments. > > > >> > > > >> > > > >> Attached is a tar ball with a basic example of the issue. You > > > >> can > > > >> uncomment each line in the mapper.swift script to see the > > > >> behavior. > > > >> I > > > >> have included 4 versions of my external mapper, two work and > > > >> two > > > >> do > > > >> not (see mapper.swift). Also included are the config, sites > > > >> (using > > > >> localhost), and tc files I'm using. > > > >> > > > >> > > > >> I noticed this issue after moving from version 0.93 to trunk > > > >> (Swift > > > >> trunk swift-r5917 cog-r3463) when some scripts I had been using > > > >> started to hang. > > > >> > > > >> > > > >> Thanks in advance, > > > >> Robin > > > >> > > > >> > > > >> > > > >> > > > >> -- > > > >> > > > >> Robin M. Weiss > > > >> Research Programmer > > > >> Research Computing Center > > > >> The University of Chicago > > > >> 6030 S. Ellis Ave., Suite 289C > > > >> Chicago, IL 60637 > > > >> robinweiss at uchicago.edu > > > >> 773.702.9030 > > > >> _______________________________________________ > > > >> Swift-user mailing list > > > >> Swift-user at ci.uchicago.edu > > > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > >-- > > > >Michael Wilde > > > >Computation Institute, University of Chicago > > > >Mathematics and Computer Science Division > > > >Argonne National Laboratory > > > > > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From iraicu at cs.iit.edu Fri Sep 14 16:33:06 2012 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Fri, 14 Sep 2012 16:33:06 -0500 Subject: [Swift-user] HPDC 2013 Call for Workshops Message-ID: <5053A292.70300@cs.iit.edu> HPDC 2013 Call for Workshops http://www.hpdc.org/2013/workshops/call-for-workshops/ The organizers of the 22nd International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC'13) call for proposals for workshops to be held with HPDC'13. The workshops will be held on June 17-18, 2013. Workshops should provide forums for discussion among researchers and practitioners on focused topics or emerging research areas relevant to the HPDC community. Organizers may structure workshops as they see fit, including invited talks, panel discussions, presentations of work in progress, fully peer-reviewed papers, or some combination. Workshops could be scheduled for half a day or a full day, depending on interest, space constraints, and organizer preference. Organizers should design workshops for approximately 20-40 participants, to balance impact and effective discussion. Workshop proposals must be sent in PDF format to the HPDC'13 Workshops Chair, Abhishek Chandra (Email: chandra AT cs DOT umn DOT edu) with the subject line "HPDC 2013 Workshop Proposal", and should include: - The name and acronym of the workshop - A description (0.5-1 page) of the theme of the workshop - A description (one paragraph) of the relation between the theme of the workshop and of HPDC - A list of topics of interest - The names and affiliations of the workshop organizers, and if applicable, of a significant portion of the program committee - A description of the expected structure of the workshop (papers, invited talks, panel discussions, etc.) - Data about previous offerings of the workshop (if any), including the attendance, the numbers of papers or presentations submitted and accepted, and the links to the corresponding websites - A publicity plan for attracting submissions and attendees. Please also include expected number of submissions, accepted papers, and attendees that you anticipate for a successful workshop. Due to publication deadlines, workshops must operate within roughly the following timeline: papers due mid February (2-3 weeks after the HPDC deadline), and selected and sent to the publisher by mid April. Important dates: Workshop Proposals Due: October 25, 2012 Notifications: November 2, 2012 Workshop CFPs Online and Distributed: November 25, 2012 -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ Web: http://datasys.cs.iit.edu/ ================================================================= ================================================================= -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.iit.edu Fri Sep 14 16:58:32 2012 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Fri, 14 Sep 2012 16:58:32 -0500 Subject: [Swift-user] CFP: 13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) 2013 Message-ID: <5053A888.6010001@cs.iit.edu> **** CALL FOR PAPERS **** The 13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2013) Delft University of Technology, Delft, the Netherlands May 13-16, 2013 http://www.pds.ewi.tudelft.nl/ccgrid2013 Rapid advances in architectures, networks, and systems and middleware technologies are leading to new concepts in and platforms for computing, ranging from Clusters and Grids to Clouds and Datacenters. CCGrid is a series of very successful conferences, sponsored by the IEEE Computer Society Technical Committee on Scalable Computing (TCSC) and the ACM, with the overarching goal of bringing together international researchers, developers, and users to provide an international forum to present leading research activities and results on a broad range of topics related to these concepts and platforms, and their applications. The conference features keynotes, technical presentations, workshops, tutorials, and posters, as well as the SCALE challenge featuring live demonstrations. In 2013, CCGrid will come to the Netherlands for the first time, and will be held in Delft, a historical, picturesque city that is less than one hour away from Amsterdam-Schiphol airport. The main conference will be held on May 14-16 (Tuesday to Thursday), with tutorials and affiliated workshops taking place on May 13 (Monday). **** IMPORTANT DATES **** Papers Due: 12 November 2012 Author Notifications: 24 January 2013 Final Papers Due: 22 February 2013 **** TOPICS OF INTEREST **** CCGrid 2013 will have a focus on important and immediate issues that are significantly influencing all aspects of cluster, cloud and grid computing. Topics of interest include, but are not limited to: * Applications and Experiences: Applications to real and complex problems in science, engineering, business, and society; User studies; Experiences with large-scale deployments, systems, or applications * Architecture: System architectures, design and deployment; Power and cooling; Security and reliability; High availability solutions * Autonomic Computing and Cyberinfrastructure: Self-managed behavior, models and technologies; Autonomic paradigms and systems (control-based, bio-inspired, emergent, etc.); Bio-inspired optimizations and computing * Cloud Computing: Cloud architectures; Software tools and techniques for clouds * Multicore and Accelerator-based Computing: Software and application techniques to utilize multicore architectures and accelerators in clusters, grids, and clouds * Performance Modeling and Evaluation: Performance prediction and modeling; Monitoring and evaluation tools; Analysis of system and application performance; Benchmarks and testbeds * Programming Models, Systems, and Fault-Tolerant Computing: Programming models and environments for cluster, cloud, and grid computing; Fault-tolerant systems, programs and algorithms; Systems software to support efficient computing * Scheduling and Resource Management: Techniques to schedule jobs and resources on cluster, cloud, and grid computing platforms; SLA definition and enforcement **** PAPER SUBMISSION GUIDELINES **** Authors are invited to submit papers electronically in PDF format. Submitted manuscripts should be structured as technical papers and may not exceed 8 letter-size (8.5 x 11) pages including figures, tables and references using the IEEE format for conference proceedings. Submissions not conforming to these guidelines may be returned without review. Authors should make sure that their file will print on a printer that uses letter-size (8.5 x 11) paper. The official language of the conference is English. All manuscripts will be reviewed and will be judged on correctness, originality, technical strength, significance, quality of presentation, and interest and relevance to the conference attendees. Submitted papers must represent original unpublished research that is not currently under review for any other conference or journal. Papers not following these guidelines will be rejected without review and further action may be taken, including (but not limited to) notifications sent to the heads of the institutions of the authors and sponsors of the conference. Submissions received after the due date, exceeding the page limit, or not appropriately structured may not be considered. Authors may contact the conference chairs for more information. The proceedings will be published through the IEEE Computer Society Press, USA, and will be made available online through the IEEE Digital Library. **** CALL FOR TUTORIAL AND WORKSHOP PROPOSALS **** Tutorials and workshops affiliated with CCGrid 2013 will be held on May 13 (Monday). For more information on the tutorials and workshops and for the complete Call for Tutorial and Workshop Proposals, please see the conference website. **** GENERAL CHAIR **** Dick Epema, Delft University of Technology, the Netherlands **** PROGRAM CHAIR **** Thomas Fahringer, University of Innsbruck, Austria **** PROGRAM VICE-CHAIRS **** Rosa Badia, Barcelona Supercomputing Center, Spain Henri Bal, Vrije Universiteit, the Netherlands Marios Dikaiakos, University of Cyprus, Cyprus Kirk Cameron, VirginiaTech, USA Daniel Katz, University of Chicago & Argonne Nat Lab, USA Kate Keahey, Argonne National Laboratory, USA Martin Schulz, Lawrence Livermore National Laboratory, USA Douglas Thain, University of Notre Dame, USA Cheng-Zhong Xu, Shenzhen Inst. of Advanced Techn, China **** WORKSHOPS CO-CHAIRS **** Shantenu Jha, Rutgers and Louisana State University, USA Ioan Raicu, Illinois Institute of Technology, USA **** TOTORIALS CHAIR **** Radu Prodan, University of Innsbruck, Austria **** DOCTORAL SYMPOSIUM CO-CHAIRS **** Yogesh Simmhan, University of Southern California, USA Ana Varbanescu, Delft Univ of Technology, the Netherlands **** SUBMISSIONS AND PROCEEDINGS CHAIR **** Pavan Balaji, Argonne National Laboratory, USA **** FINANCE AND REGISTRATION CHAIR **** Alexandru Iosup, Delft Univ of Technology, the Netherlands **** PUBLICITY CHAIRS **** Nazareno Andrade, University Federal de Campina Grance, Brazil Gabriel Antoniu, INRIA, France Bahman Javadi, University of Western Sysney, Australia Ioan Raicu, Illinois Institute of Technology and Argonne National Laboratory, USA Kin Choong Yow, Shenzhen Inst. of Advanced Technology, China **** CYBER CHAIR **** Stephen van der Laan, Delft University of Technology, the Netherlands -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ Web: http://datasys.cs.iit.edu/ ================================================================= ================================================================= -------------- next part -------------- An HTML attachment was scrubbed... URL: From iraicu at cs.iit.edu Fri Sep 14 18:59:42 2012 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Fri, 14 Sep 2012 18:59:42 -0500 Subject: [Swift-user] Call for Workshops: IEEE/ACM CCGrid 2013 Message-ID: <5053C4EE.6070700@cs.iit.edu> **** CALL FOR WORKSHOPS **** The 13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2013) Delft University of Technology, Delft, the Netherlands May 13-16, 2013 http://www.pds.ewi.tudelft.nl/ccgrid2013 http://www.pds.ewi.tudelft.nl/ccgrid2013/calls/workshops/ Workshop proposals are invited for CCGrid 2013 on specific aspects of Grid, Cloud and Cluster Computing, particularly relating to the subject areas indicated by the topics below. We encourage workshops that will discuss fundamental research issues driven by academic interests or more applied industrial or commercial concerns. The format of the workshop will be determined by the organizers. Workshops can vary in length from a half day to a full day. Having more than one co-organizer for a workshop is strongly advised. Workshop proceedings will be published as part of the CCGrid 2013 proceedings. So, it is very important that high quality workshops are accepted, and that workshop chairs observe strict quality standards, no more than 50% acceptance, in the selection of papers for their events. Workshop attendees are required to register for the main conference. IMPORTANT DATES AND PROPOSAL SUBMISSION: --------------------------------------------------------------------------- Workshop proposals and any enquiries should be sent by e-mail to the workshop chairs. Proposals should be submitted in PDF format (printable on A4 paper). Workshop Proposals Due: October 20, 2012 Notification: November 3, 2012 Workshops: May 13-16 2013 PROPOSAL REQUIREMENTS: ------------------------------------------- Proposals for workshops should be between 2 to 5 pages in length. They should contain the following information: o Title and brief technical description of the workshop, specifying the goals and the technical issues that will be its focus. o A brief abstract of the workshop (less than 200 words), intended for the CCGrid 2013 web site. o A brief description of why and to whom the workshop is of interest. o A list of related workshops or similar events held in the last 3 years or to be held in 2012/2013. o The names and contact information (web page, email address) of the proposed organizing committee. This committee should consist of two or three people knowledgeable about the technical issues to be addressed, preferably not members of the same institution. A brief description of the qualifications of the proposed organizing committee with respect to organizing this workshop (e.g., papers published in the proposed topic area, previous workshop organization, other relevant information). o Link to a preliminary web site of the workshop and a preliminary call for papers o A list of committed and proposed PC members. RESPONSIBILITIES: ------------------------------ Each workshop organizing committee will be responsible for the following: o Producing a web page and a "Call for Papers/Participation" for their workshop. The URL should be sent to the workshop co-chairs for CCGrid 2013. o The call must make it clear that the workshop is open to all members of the Distributed Computing community. It should mention that at least one author of each accepted submission must attend the workshop and that all workshop participants must pay the conference fee. o Finally, it should also clearly describe the process by which the Organizing Committee will select the participants. o Ensure that all workshop papers are a maximum of 6 pages in length (in IEEE format). It is the responsibility of the workshop organizers to ensure that this page limit has been adhered to. Additional pages may be purchased (in some circumstances) subject to approval of the proceedings chair. o Provide a brief description of the workshop for the conference web page and program. o Selecting the participants and the format of the workshop. The publication of proceedings will be by the IEEE in the same volume as the main conference. Camera-ready due date for the accepted workshop papers will be the same as the main conference. Therefore, workshop organizers should set the acceptance notification date at least 2 weeks earlier than the camera-ready due date. All other details can be up to workshop organizers to set, such as advertising the workshop beyond the conference web page and assistance in producing a camera-ready version of the workshop proceedings. Important Note: ----------------------- Workshop organizers must ensure that suitable quality measures have been taken to ensure that the accepted papers are of high quality. All papers must be reviewed by an International Programme Committee (with a minimum of 3 reviews per paper). The CCGrid 2013 Organizing Committee will be responsible for the following: o Providing a link to a workshop's local page. o Providing logistics support and a meeting place for the workshop. o In conjunction with the organizers, determining the workshop date and time. o Providing copies of the workshop proceedings to attendees. Important Note: ----------------------- The CCGrid 2013 Organizing Committee may decide, if the workshop is too small (i.e. does not attract enough submissions) to merge it with another workshop. So we encourage workshop organizers to attract a large community. In extreme situations we may also cancel workshops if there are not enough submissions. Conference Topics of Interest and Area Keywords: ------------------------------------------------------------------------- Topics of interest to the conference include (but are not restricted to): o Autonomic Grid Computing o Content Distribution Networks o Cloud Computing o Cluster Computing o Grid Computing o Peer-2-Peer Computing o Multi-Core Systems o Grid and Cloud Testbeds o Semantic Grids o Web 2.0 Technologies o Workflow Tools and Applications o Programming Models o Energy Management in Data Centers o Resource Management o Service Level Agreements and Scheduling o Tools and Environments o Scientific Instruments and Grid Computing o Application areas: HealthCare/Life Sciences, Engineering, etc. WORKSHOP CHAIRS: -------------------------------- Please send your proposals to both workshop chairs. o Ioan Raicu (iraicu at cs.iit.edu), Illinois Institute of Technology, USA o Shantenu Jha (shantenu.jha at rutgers.edu), Rutgers University, USA -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ Web: http://datasys.cs.iit.edu/ ================================================================= ================================================================= From hategan at mcs.anl.gov Sat Sep 15 15:51:26 2012 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 15 Sep 2012 13:51:26 -0700 Subject: [Swift-user] GT4 Message-ID: <1347742286.14827.1.camel@blabla> Hi, Is anybody still using the GT4/WS-GRAM provider? I would like to retire it in trunk, but I wanted to ask first. Mihael From wilde at mcs.anl.gov Sat Sep 15 16:19:29 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 15 Sep 2012 16:19:29 -0500 (CDT) Subject: [Swift-user] GT4 In-Reply-To: <1347742286.14827.1.camel@blabla> Message-ID: <1229025897.18500.1347743969573.JavaMail.root@zimbra.anl.gov> I agree with retiring it. I havent heard of anyone using it in years. - Mike ----- Original Message ----- > From: "Mihael Hategan" > To: "swift-user" > Sent: Saturday, September 15, 2012 3:51:26 PM > Subject: [Swift-user] GT4 > Hi, > > Is anybody still using the GT4/WS-GRAM provider? > > I would like to retire it in trunk, but I wanted to ask first. > > Mihael > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From wilde at mcs.anl.gov Sat Sep 15 16:59:14 2012 From: wilde at mcs.anl.gov (Michael Wilde) Date: Sat, 15 Sep 2012 16:59:14 -0500 (CDT) Subject: [Swift-user] Custom/External Mappers In-Reply-To: <1491733435.17196.1347649138298.JavaMail.root@zimbra.anl.gov> Message-ID: <663648207.18552.1347746354941.JavaMail.root@zimbra.anl.gov> This should be fixed in trunk now and thus -line wont be added any more. - Mike ----- Original Message ----- > From: "Michael Wilde" > To: "Robin Weiss" > Cc: swift-user at ci.uchicago.edu > Sent: Friday, September 14, 2012 1:58:58 PM > Subject: Re: [Swift-user] Custom/External Mappers > Robin, adding this line to your mapper to handle the unexpected -line > argument makes the example you sent work now: > > -line) ;; > > I was unable to find where this change got into trunk, but I will ask > on the swift-devel list (and in bug 829) to back off the change. > > - Mike > > > ----- Original Message ----- > > From: "Michael Wilde" > > To: "Robin Weiss" > > Cc: swift-user at ci.uchicago.edu > > Sent: Friday, September 14, 2012 1:43:42 PM > > Subject: Re: [Swift-user] Custom/External Mappers > > OK, I'm revising my diagnosis here. It looks like a change was > > introduced into trunk which is adding an extra argument to the ext > > mapper invocation giving the line number in the .swift script from > > which it was called. This looks to me like some developer debugging > > that got committed by mistake. We will post to this list when its > > removed. You can see the problem by inserting a debug echo into your > > mapper: > > > > in > > /autonfs/home/wilde/swift/lab/rwmapper.2012.0914/clusMapper_name.sh > > -name name_here -line 16 > > > > Your mapper (not surprisingly) rejects the -line 16 argument and > > fails. > > > > That said, Swift needs to do a better job in handling and reporting > > failures in external mappers. > > > > - Mike > > > > ----- Original Message ----- > > > From: "Michael Wilde" > > > To: "Robin Weiss" > > > Cc: swift-user at ci.uchicago.edu > > > Sent: Friday, September 14, 2012 1:32:44 PM > > > Subject: Re: [Swift-user] Custom/External Mappers > > > Ooops, my mistake, sorry. > > > > > > You are correct, the ext mapper args are passed as "-argname > > > value". > > > The user guide is correct and that did not likely change in trunk. > > > > > > So the problem is elsewhere. I will investigate further. > > > > > > - Mike > > > > > > > > > ----- Original Message ----- > > > > From: "Robin Weiss" > > > > To: "Michael Wilde" , > > > > swift-user at ci.uchicago.edu > > > > Sent: Friday, September 14, 2012 1:26:39 PM > > > > Subject: Re: [Swift-user] Custom/External Mappers > > > > Interesting. I wrote the custom mapper script expecting swift to > > > > call > > > > it > > > > with a whitespace (not equal sign) between the parameter switch > > > > and > > > > the > > > > value that goes with it (I.e.: ./clusMapper_name.sh -name > > > > name_goes_here). > > > > This is how the example in the swift users guide for 0.93 > > > > (http://www.ci.uchicago.edu/swift/guides/release-0.93/userguide/userguide.h > > > > tml#_external_mapper) does it. The example from the users guide > > > > will > > > > also > > > > chokes if you put an equal sign instead of whitespace between > > > > the > > > > param > > > > name switch and the value. > > > > > > > > This issue popped up after we updated to the latest trunk > > > > version > > > > from > > > > 0.93. Did the way swift calls external mappers change (I.e. > > > > Equal > > > > sign > > > > separated instead of whitespace between param name and value)? > > > > > > > > Robin > > > > > > > > > > > > -- > > > > Robin M. Weiss > > > > Research Programmer > > > > Research Computing Center > > > > The University of Chicago > > > > 6030 S. Ellis Ave., Suite 289C > > > > Chicago, IL 60637 > > > > robinweiss at uchicago.edu > > > > 773.702.9030 > > > > > > > > > > > > > > > > > > > > > > > > > > > > On 9/14/12 1:14 PM, "Michael Wilde" wrote: > > > > > > > > >Robin, I looked into this problem a bit. It seems that your > > > > >external > > > > >mapper scripts are not returning anything on standard output > > > > >(and > > > > >an > > > > >error message on stderr) for the arguments that you are passing > > > > >in > > > > >the > > > > >failing cases. > > > > > > > > > >Here's what my tests show: > > > > > > > > > >$ ./clusMapper_name.sh > > > > >[0] out/.0.out > > > > >[1] out/.1.out > > > > >[2] out/.2.out > > > > >[3] out/.3.out > > > > >$ > > > > > > > > > >$ ./clusMapper_name.sh -name=name_here > > > > >./clusMapper_name.sh: bad mapper args > > > > >$ > > > > > > > > > >$ ./clusMapper_loop.sh -nfiles=3 > > > > >./clusMapper_loop.sh: bad mapper args > > > > >$ > > > > > > > > > >I dont see exactly why Swift is hanging in the cases where the > > > > >mappers > > > > >return nothing. I would have thought a nicer behavior would be > > > > >to > > > > >(a) > > > > >exit complaining that the external mapper didnt return with > > > > >exit > > > > >code > > > > >0, > > > > >or (b) proceed as if the arrays were not mapped, and use the > > > > >default > > > > >(concurrent) mapper. > > > > > > > > > >I suspect whats happening is that Swift is not closing the > > > > >array > > > > >when > > > > >the > > > > >mapper fails, or some similar synchronization error. We'll need > > > > >to > > > > >look > > > > >into this. I've filed this as bugzilla bug # 829. > > > > > > > > > >- Mike > > > > > > > > > >----- Original Message ----- > > > > >> From: "Robin Weiss" > > > > >> To: swift-user at ci.uchicago.edu > > > > >> Sent: Monday, September 10, 2012 4:56:45 PM > > > > >> Subject: [Swift-user] Custom/External Mappers > > > > >> Howdy swifters, > > > > >> > > > > >> > > > > >> I am having some issues getting external mappers to work. In > > > > >> particular, swift appears to hang when you use a custom > > > > >> mapper > > > > >> that > > > > >> takes in one or more command line arguments. > > > > >> > > > > >> > > > > >> Attached is a tar ball with a basic example of the issue. You > > > > >> can > > > > >> uncomment each line in the mapper.swift script to see the > > > > >> behavior. > > > > >> I > > > > >> have included 4 versions of my external mapper, two work and > > > > >> two > > > > >> do > > > > >> not (see mapper.swift). Also included are the config, sites > > > > >> (using > > > > >> localhost), and tc files I'm using. > > > > >> > > > > >> > > > > >> I noticed this issue after moving from version 0.93 to trunk > > > > >> (Swift > > > > >> trunk swift-r5917 cog-r3463) when some scripts I had been > > > > >> using > > > > >> started to hang. > > > > >> > > > > >> > > > > >> Thanks in advance, > > > > >> Robin > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> -- > > > > >> > > > > >> Robin M. Weiss > > > > >> Research Programmer > > > > >> Research Computing Center > > > > >> The University of Chicago > > > > >> 6030 S. Ellis Ave., Suite 289C > > > > >> Chicago, IL 60637 > > > > >> robinweiss at uchicago.edu > > > > >> 773.702.9030 > > > > >> _______________________________________________ > > > > >> Swift-user mailing list > > > > >> Swift-user at ci.uchicago.edu > > > > >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > > > > > > >-- > > > > >Michael Wilde > > > > >Computation Institute, University of Chicago > > > > >Mathematics and Computer Science Division > > > > >Argonne National Laboratory > > > > > > > > > > > -- > > > Michael Wilde > > > Computation Institute, University of Chicago > > > Mathematics and Computer Science Division > > > Argonne National Laboratory > > > > > > _______________________________________________ > > > Swift-user mailing list > > > Swift-user at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > > > -- > > Michael Wilde > > Computation Institute, University of Chicago > > Mathematics and Computer Science Division > > Argonne National Laboratory > > > > _______________________________________________ > > Swift-user mailing list > > Swift-user at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user > > -- > Michael Wilde > Computation Institute, University of Chicago > Mathematics and Computer Science Division > Argonne National Laboratory > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user -- Michael Wilde Computation Institute, University of Chicago Mathematics and Computer Science Division Argonne National Laboratory From marco at hep.uchicago.edu Sat Sep 15 20:06:35 2012 From: marco at hep.uchicago.edu (Marco Mambelli) Date: Sat, 15 Sep 2012 20:06:35 -0500 (CDT) Subject: [Swift-user] GT4 In-Reply-To: <1347742286.14827.1.camel@blabla> References: <1347742286.14827.1.camel@blabla> Message-ID: OSG 1.2 sites have all GT4, and they can enable WS-GRAM if they like but I think noone does. Marco On Sat, 15 Sep 2012, Mihael Hategan wrote: > Hi, > > Is anybody still using the GT4/WS-GRAM provider? > > I would like to retire it in trunk, but I wanted to ask first. > > Mihael > > _______________________________________________ > Swift-user mailing list > Swift-user at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user >