From ketan at mcs.anl.gov Tue Sep 2 15:01:52 2014 From: ketan at mcs.anl.gov (Ketan Maheshwari) Date: Tue, 2 Sep 2014 15:01:52 -0500 Subject: [Swift-devel] error building swift trunk on bluegene In-Reply-To: <1408831365.24566.1.camel@echo> References: <1408831365.24566.1.camel@echo> Message-ID: The error seems to be persisting on bluegene with the latest trunk. On Sat, Aug 23, 2014 at 5:02 PM, Mihael Hategan wrote: > That particular issue should now be fixed in trunk. I'm able to build > swift with the ibm libraries (although within eclipse, so please give > this a shot in a real environment). > > Mihael > > On Fri, 2014-08-22 at 13:44 -0500, Ketan Maheshwari wrote: > > PS. The current java is from IBM: > > > > $ java -version > > java version "1.6.0" > > Java(TM) SE Runtime Environment (build pxp6460sr15-20131017_01(SR15)) > > IBM J9 VM (build 2.4, JRE 1.6.0 IBM J9 2.4 Linux ppc64-64 > > jvmxp6460sr15-20131016_170922 (JIT enabled, AOT enabled) > > J9VM - 20131016_170922 > > JIT - r9_20130920_46510ifx2 > > GC - GA24_Java6_SR15_20131016_1337_B170922) > > JCL - 20131015_01 > > > > > > On Fri, Aug 22, 2014 at 1:41 PM, Ketan Maheshwari > wrote: > > > > > Hi, > > > > > > I am getting this error trying to build Swift trunk on BG Vesta: > > > > > > compile: > > > [echo] [util]: COMPILE > > > [mkdir] Created dir: > > > /gpfs/vesta-home/ketan/swift-devel/cog/modules/util/build > > > [javac] Compiling 56 source files to > > > /gpfs/vesta-home/ketan/swift-devel/cog/modules/util/build > > > [javac] > > > > /gpfs/vesta-home/ketan/swift-devel/cog/modules/util/src/org/globus/cog/util/concurrent/FileLock.java:91: > > > cannot find symbol > > > [javac] symbol : class VMManagement > > > [javac] location: package sun.management > > > [javac] sun.management.VMManagement mgmt = > > > (sun.management.VMManagement) jvm.get(runtime); > > > [javac] ^ > > > [javac] > > > > /gpfs/vesta-home/ketan/swift-devel/cog/modules/util/src/org/globus/cog/util/concurrent/FileLock.java:91: > > > cannot find symbol > > > [javac] symbol : class VMManagement > > > [javac] location: package sun.management > > > [javac] sun.management.VMManagement mgmt = > > > (sun.management.VMManagement) jvm.get(runtime); > > > [javac] ^ > > > [javac] Note: Some input files use or override a deprecated API. > > > [javac] Note: Recompile with -Xlint:deprecation for details. > > > [javac] Note: Some input files use unchecked or unsafe operations. > > > [javac] Note: Recompile with -Xlint:unchecked for details. > > > [javac] 2 errors > > > > > > Doe this means I need sun java on Vesta? Does anyone know if Sun java > is > > > available for PPC64 architecture? > > > > > > Thanks, > > > Ketan > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Tue Sep 2 17:11:27 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 2 Sep 2014 15:11:27 -0700 Subject: [Swift-devel] error building swift trunk on bluegene In-Reply-To: References: <1408831365.24566.1.camel@echo> Message-ID: <1409695887.16233.14.camel@echo> Unless we have different definitions of "latest trunk", I don't think that's possible. See https://github.com/swift-lang/swift-k/blob/master/cogkit/modules/util/src/org/globus/cog/util/concurrent/FileLock.java Line 91 is this: --------------- 91: } --------------- Mihael On Tue, 2014-09-02 at 15:01 -0500, Ketan Maheshwari wrote: > The error seems to be persisting on bluegene with the latest trunk. > > > On Sat, Aug 23, 2014 at 5:02 PM, Mihael Hategan wrote: > > > That particular issue should now be fixed in trunk. I'm able to build > > swift with the ibm libraries (although within eclipse, so please give > > this a shot in a real environment). > > > > Mihael > > > > On Fri, 2014-08-22 at 13:44 -0500, Ketan Maheshwari wrote: > > > PS. The current java is from IBM: > > > > > > $ java -version > > > java version "1.6.0" > > > Java(TM) SE Runtime Environment (build pxp6460sr15-20131017_01(SR15)) > > > IBM J9 VM (build 2.4, JRE 1.6.0 IBM J9 2.4 Linux ppc64-64 > > > jvmxp6460sr15-20131016_170922 (JIT enabled, AOT enabled) > > > J9VM - 20131016_170922 > > > JIT - r9_20130920_46510ifx2 > > > GC - GA24_Java6_SR15_20131016_1337_B170922) > > > JCL - 20131015_01 > > > > > > > > > On Fri, Aug 22, 2014 at 1:41 PM, Ketan Maheshwari > > wrote: > > > > > > > Hi, > > > > > > > > I am getting this error trying to build Swift trunk on BG Vesta: > > > > > > > > compile: > > > > [echo] [util]: COMPILE > > > > [mkdir] Created dir: > > > > /gpfs/vesta-home/ketan/swift-devel/cog/modules/util/build > > > > [javac] Compiling 56 source files to > > > > /gpfs/vesta-home/ketan/swift-devel/cog/modules/util/build > > > > [javac] > > > > > > /gpfs/vesta-home/ketan/swift-devel/cog/modules/util/src/org/globus/cog/util/concurrent/FileLock.java:91: > > > > cannot find symbol > > > > [javac] symbol : class VMManagement > > > > [javac] location: package sun.management > > > > [javac] sun.management.VMManagement mgmt = > > > > (sun.management.VMManagement) jvm.get(runtime); > > > > [javac] ^ > > > > [javac] > > > > > > /gpfs/vesta-home/ketan/swift-devel/cog/modules/util/src/org/globus/cog/util/concurrent/FileLock.java:91: > > > > cannot find symbol > > > > [javac] symbol : class VMManagement > > > > [javac] location: package sun.management > > > > [javac] sun.management.VMManagement mgmt = > > > > (sun.management.VMManagement) jvm.get(runtime); > > > > [javac] ^ > > > > [javac] Note: Some input files use or override a deprecated API. > > > > [javac] Note: Recompile with -Xlint:deprecation for details. > > > > [javac] Note: Some input files use unchecked or unsafe operations. > > > > [javac] Note: Recompile with -Xlint:unchecked for details. > > > > [javac] 2 errors > > > > > > > > Doe this means I need sun java on Vesta? Does anyone know if Sun java > > is > > > > available for PPC64 architecture? > > > > > > > > Thanks, > > > > Ketan > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > From ketan at mcs.anl.gov Wed Sep 3 09:42:33 2014 From: ketan at mcs.anl.gov (Ketan Maheshwari) Date: Wed, 3 Sep 2014 09:42:33 -0500 Subject: [Swift-devel] error building swift trunk on bluegene In-Reply-To: <1409695887.16233.14.camel@echo> References: <1408831365.24566.1.camel@echo> <1409695887.16233.14.camel@echo> Message-ID: I was using svn repo on bluegene. Trying with git repo, I am unable to build: $ pwd /home/ketan/swift-k $ ant redist Buildfile: build.xml cleanGenerated: BUILD FAILED /gpfs/vesta-home/ketan/swift-k/build.xml:402: Directory does not exist:/gpfs/vesta-home/ketan/swift-k/src/org/griphyn/vdl/model Total time: 0 seconds On Tue, Sep 2, 2014 at 5:11 PM, Mihael Hategan wrote: > Unless we have different definitions of "latest trunk", I don't think > that's possible. > > See > > https://github.com/swift-lang/swift-k/blob/master/cogkit/modules/util/src/org/globus/cog/util/concurrent/FileLock.java > > Line 91 is this: > --------------- > 91: } > --------------- > > Mihael > > On Tue, 2014-09-02 at 15:01 -0500, Ketan Maheshwari wrote: > > The error seems to be persisting on bluegene with the latest trunk. > > > > > > On Sat, Aug 23, 2014 at 5:02 PM, Mihael Hategan > wrote: > > > > > That particular issue should now be fixed in trunk. I'm able to build > > > swift with the ibm libraries (although within eclipse, so please give > > > this a shot in a real environment). > > > > > > Mihael > > > > > > On Fri, 2014-08-22 at 13:44 -0500, Ketan Maheshwari wrote: > > > > PS. The current java is from IBM: > > > > > > > > $ java -version > > > > java version "1.6.0" > > > > Java(TM) SE Runtime Environment (build pxp6460sr15-20131017_01(SR15)) > > > > IBM J9 VM (build 2.4, JRE 1.6.0 IBM J9 2.4 Linux ppc64-64 > > > > jvmxp6460sr15-20131016_170922 (JIT enabled, AOT enabled) > > > > J9VM - 20131016_170922 > > > > JIT - r9_20130920_46510ifx2 > > > > GC - GA24_Java6_SR15_20131016_1337_B170922) > > > > JCL - 20131015_01 > > > > > > > > > > > > On Fri, Aug 22, 2014 at 1:41 PM, Ketan Maheshwari > > > > wrote: > > > > > > > > > Hi, > > > > > > > > > > I am getting this error trying to build Swift trunk on BG Vesta: > > > > > > > > > > compile: > > > > > [echo] [util]: COMPILE > > > > > [mkdir] Created dir: > > > > > /gpfs/vesta-home/ketan/swift-devel/cog/modules/util/build > > > > > [javac] Compiling 56 source files to > > > > > /gpfs/vesta-home/ketan/swift-devel/cog/modules/util/build > > > > > [javac] > > > > > > > > > /gpfs/vesta-home/ketan/swift-devel/cog/modules/util/src/org/globus/cog/util/concurrent/FileLock.java:91: > > > > > cannot find symbol > > > > > [javac] symbol : class VMManagement > > > > > [javac] location: package sun.management > > > > > [javac] sun.management.VMManagement mgmt = > > > > > (sun.management.VMManagement) jvm.get(runtime); > > > > > [javac] ^ > > > > > [javac] > > > > > > > > > /gpfs/vesta-home/ketan/swift-devel/cog/modules/util/src/org/globus/cog/util/concurrent/FileLock.java:91: > > > > > cannot find symbol > > > > > [javac] symbol : class VMManagement > > > > > [javac] location: package sun.management > > > > > [javac] sun.management.VMManagement mgmt = > > > > > (sun.management.VMManagement) jvm.get(runtime); > > > > > [javac] > ^ > > > > > [javac] Note: Some input files use or override a deprecated > API. > > > > > [javac] Note: Recompile with -Xlint:deprecation for details. > > > > > [javac] Note: Some input files use unchecked or unsafe > operations. > > > > > [javac] Note: Recompile with -Xlint:unchecked for details. > > > > > [javac] 2 errors > > > > > > > > > > Doe this means I need sun java on Vesta? Does anyone know if Sun > java > > > is > > > > > available for PPC64 architecture? > > > > > > > > > > Thanks, > > > > > Ketan > > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Wed Sep 3 13:45:25 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 3 Sep 2014 11:45:25 -0700 Subject: [Swift-devel] error building swift trunk on bluegene In-Reply-To: References: <1408831365.24566.1.camel@echo> <1409695887.16233.14.camel@echo> Message-ID: <1409769925.14892.0.camel@echo> "ant dist" You are not re-building an already compiled swift, you are building one from scratch. Mihael On Wed, 2014-09-03 at 09:42 -0500, Ketan Maheshwari wrote: > I was using svn repo on bluegene. Trying with git repo, I am unable to > build: > > $ pwd > /home/ketan/swift-k > > $ ant redist > Buildfile: build.xml > > cleanGenerated: > > BUILD FAILED > /gpfs/vesta-home/ketan/swift-k/build.xml:402: Directory does not > exist:/gpfs/vesta-home/ketan/swift-k/src/org/griphyn/vdl/model > > Total time: 0 seconds > > > On Tue, Sep 2, 2014 at 5:11 PM, Mihael Hategan wrote: > > > Unless we have different definitions of "latest trunk", I don't think > > that's possible. > > > > See > > > > https://github.com/swift-lang/swift-k/blob/master/cogkit/modules/util/src/org/globus/cog/util/concurrent/FileLock.java > > > > Line 91 is this: > > --------------- > > 91: } > > --------------- > > > > Mihael > > > > On Tue, 2014-09-02 at 15:01 -0500, Ketan Maheshwari wrote: > > > The error seems to be persisting on bluegene with the latest trunk. > > > > > > > > > On Sat, Aug 23, 2014 at 5:02 PM, Mihael Hategan > > wrote: > > > > > > > That particular issue should now be fixed in trunk. I'm able to build > > > > swift with the ibm libraries (although within eclipse, so please give > > > > this a shot in a real environment). > > > > > > > > Mihael > > > > > > > > On Fri, 2014-08-22 at 13:44 -0500, Ketan Maheshwari wrote: > > > > > PS. The current java is from IBM: > > > > > > > > > > $ java -version > > > > > java version "1.6.0" > > > > > Java(TM) SE Runtime Environment (build pxp6460sr15-20131017_01(SR15)) > > > > > IBM J9 VM (build 2.4, JRE 1.6.0 IBM J9 2.4 Linux ppc64-64 > > > > > jvmxp6460sr15-20131016_170922 (JIT enabled, AOT enabled) > > > > > J9VM - 20131016_170922 > > > > > JIT - r9_20130920_46510ifx2 > > > > > GC - GA24_Java6_SR15_20131016_1337_B170922) > > > > > JCL - 20131015_01 > > > > > > > > > > > > > > > On Fri, Aug 22, 2014 at 1:41 PM, Ketan Maheshwari > > > > > > wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > I am getting this error trying to build Swift trunk on BG Vesta: > > > > > > > > > > > > compile: > > > > > > [echo] [util]: COMPILE > > > > > > [mkdir] Created dir: > > > > > > /gpfs/vesta-home/ketan/swift-devel/cog/modules/util/build > > > > > > [javac] Compiling 56 source files to > > > > > > /gpfs/vesta-home/ketan/swift-devel/cog/modules/util/build > > > > > > [javac] > > > > > > > > > > > > /gpfs/vesta-home/ketan/swift-devel/cog/modules/util/src/org/globus/cog/util/concurrent/FileLock.java:91: > > > > > > cannot find symbol > > > > > > [javac] symbol : class VMManagement > > > > > > [javac] location: package sun.management > > > > > > [javac] sun.management.VMManagement mgmt = > > > > > > (sun.management.VMManagement) jvm.get(runtime); > > > > > > [javac] ^ > > > > > > [javac] > > > > > > > > > > > > /gpfs/vesta-home/ketan/swift-devel/cog/modules/util/src/org/globus/cog/util/concurrent/FileLock.java:91: > > > > > > cannot find symbol > > > > > > [javac] symbol : class VMManagement > > > > > > [javac] location: package sun.management > > > > > > [javac] sun.management.VMManagement mgmt = > > > > > > (sun.management.VMManagement) jvm.get(runtime); > > > > > > [javac] > > ^ > > > > > > [javac] Note: Some input files use or override a deprecated > > API. > > > > > > [javac] Note: Recompile with -Xlint:deprecation for details. > > > > > > [javac] Note: Some input files use unchecked or unsafe > > operations. > > > > > > [javac] Note: Recompile with -Xlint:unchecked for details. > > > > > > [javac] 2 errors > > > > > > > > > > > > Doe this means I need sun java on Vesta? Does anyone know if Sun > > java > > > > is > > > > > > available for PPC64 architecture? > > > > > > > > > > > > Thanks, > > > > > > Ketan > > > > > > > > > > > _______________________________________________ > > > > > Swift-devel mailing list > > > > > Swift-devel at ci.uchicago.edu > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > From tim.g.armstrong at gmail.com Wed Sep 3 16:49:03 2014 From: tim.g.armstrong at gmail.com (Tim Armstrong) Date: Wed, 3 Sep 2014 16:49:03 -0500 Subject: [Swift-devel] Coaster Task Submission Stalling Message-ID: I'm running a test Swift/T script that submit tasks to Coasters through the C++ client and I'm seeing some odd behaviour where task submission/execution is stalling for ~2 minute periods. For example, I'm seeing submit log messages like "submitting urn:133-1409778135377-1409778135378: /bin/hostname" in bursts of several seconds with a gap of roughly 2 minutes in between, e.g. I'm seeing bursts with the following intervals in my logs. 16:07:04,603 to 16:07:10,391 16:09:07,377 to 16:09:13,076 16:11:10,005 to 16:11:16,770 16:13:13,291 to 16:13:19,296 16:15:16,000 to 16:15:21,602 >From what I can tell, the delay is on the coaster service side: the C client is just waiting for a response. The jobs are just being submitted through the local job manager, so I wouldn't expect any delays there. The tasks are also just "/bin/hostname", so should return immediately. I'm going to continue digging into this on my own, but the 2 minute delay seems like a big clue: does anyone have an idea what could cause stalls in task submission of 2 minute duration? Cheers, Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Wed Sep 3 18:20:46 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 3 Sep 2014 16:20:46 -0700 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: References: Message-ID: <1409786446.18898.0.camel@echo> Hi Tim, I've never seen this before with pure Java. Do you have logs from these runs? Mihael On Wed, 2014-09-03 at 16:49 -0500, Tim Armstrong wrote: > I'm running a test Swift/T script that submit tasks to Coasters through the > C++ client and I'm seeing some odd behaviour where task > submission/execution is stalling for ~2 minute periods. For example, I'm > seeing submit log messages like "submitting > urn:133-1409778135377-1409778135378: /bin/hostname" in bursts of several > seconds with a gap of roughly 2 minutes in between, e.g. I'm seeing bursts > with the following intervals in my logs. > > 16:07:04,603 to 16:07:10,391 > 16:09:07,377 to 16:09:13,076 > 16:11:10,005 to 16:11:16,770 > 16:13:13,291 to 16:13:19,296 > 16:15:16,000 to 16:15:21,602 > > From what I can tell, the delay is on the coaster service side: the C > client is just waiting for a response. > > The jobs are just being submitted through the local job manager, so I > wouldn't expect any delays there. The tasks are also just "/bin/hostname", > so should return immediately. > > I'm going to continue digging into this on my own, but the 2 minute delay > seems like a big clue: does anyone have an idea what could cause stalls in > task submission of 2 minute duration? > > Cheers, > Tim From tim.g.armstrong at gmail.com Wed Sep 3 20:26:33 2014 From: tim.g.armstrong at gmail.com (Tim Armstrong) Date: Wed, 3 Sep 2014 20:26:33 -0500 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: <1409786446.18898.0.camel@echo> References: <1409786446.18898.0.camel@echo> Message-ID: Here are client and service logs, with part of service log edited down to be a reasonable size (I have the full thing if needed, but it was over a gigabyte). One relevant section is from 19:49:35 onwards. The client submits 4 jobs (its limit), but they don't complete until 19:51:32 or so (I can see that one task completed based on ncompleted=1 in the check_tasks log message). It looks like something has happened with broken pipes and workers being lost, but I'm not sure what the ultimate cause of that is likely to be. - Tim On Wed, Sep 3, 2014 at 6:20 PM, Mihael Hategan wrote: > Hi Tim, > > I've never seen this before with pure Java. > > Do you have logs from these runs? > > Mihael > > On Wed, 2014-09-03 at 16:49 -0500, Tim Armstrong wrote: > > I'm running a test Swift/T script that submit tasks to Coasters through > the > > C++ client and I'm seeing some odd behaviour where task > > submission/execution is stalling for ~2 minute periods. For example, I'm > > seeing submit log messages like "submitting > > urn:133-1409778135377-1409778135378: /bin/hostname" in bursts of several > > seconds with a gap of roughly 2 minutes in between, e.g. I'm seeing > bursts > > with the following intervals in my logs. > > > > 16:07:04,603 to 16:07:10,391 > > 16:09:07,377 to 16:09:13,076 > > 16:11:10,005 to 16:11:16,770 > > 16:13:13,291 to 16:13:19,296 > > 16:15:16,000 to 16:15:21,602 > > > > From what I can tell, the delay is on the coaster service side: the C > > client is just waiting for a response. > > > > The jobs are just being submitted through the local job manager, so I > > wouldn't expect any delays there. The tasks are also just > "/bin/hostname", > > so should return immediately. > > > > I'm going to continue digging into this on my own, but the 2 minute delay > > seems like a big clue: does anyone have an idea what could cause stalls > in > > task submission of 2 minute duration? > > > > Cheers, > > Tim > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: coaster-service.out.gz Type: application/x-gzip Size: 36069 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: swift-t-client.out.gz Type: application/x-gzip Size: 1049192 bytes Desc: not available URL: From hategan at mcs.anl.gov Wed Sep 3 22:35:22 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 3 Sep 2014 20:35:22 -0700 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: References: <1409786446.18898.0.camel@echo> Message-ID: <1409801722.21132.8.camel@echo> Ah, makes sense. 2 minutes is the channel timeout. Each live connection is guaranteed to have some communication for any 2 minute time window, partially due to periodic heartbeats (sent every 1 minute). If no packets flow for the duration of 2 minutes, the connection is assumed broken and all jobs that were submitted to the respective workers are considered failed. So there seems to be an issue with the connections to some of the workers, and it takes 2 minutes to detect them. Since the service seems to be alive (although a jstack on the service when thing seem to hang might help), this leaves two possibilities: 1 - some genuine network problem 2 - the worker died without properly closing TCP connections If (2), you could enable worker logging (Settings::Key::WORKER_LOGGING_LEVEL = "DEBUG") to see if anything shows up. Mihael On Wed, 2014-09-03 at 20:26 -0500, Tim Armstrong wrote: > Here are client and service logs, with part of service log edited down to > be a reasonable size (I have the full thing if needed, but it was over a > gigabyte). > > One relevant section is from 19:49:35 onwards. The client submits 4 jobs > (its limit), but they don't complete until 19:51:32 or so (I can see that > one task completed based on ncompleted=1 in the check_tasks log message). > It looks like something has happened with broken pipes and workers being > lost, but I'm not sure what the ultimate cause of that is likely to be. > > - Tim > > > > On Wed, Sep 3, 2014 at 6:20 PM, Mihael Hategan wrote: > > > Hi Tim, > > > > I've never seen this before with pure Java. > > > > Do you have logs from these runs? > > > > Mihael > > > > On Wed, 2014-09-03 at 16:49 -0500, Tim Armstrong wrote: > > > I'm running a test Swift/T script that submit tasks to Coasters through > > the > > > C++ client and I'm seeing some odd behaviour where task > > > submission/execution is stalling for ~2 minute periods. For example, I'm > > > seeing submit log messages like "submitting > > > urn:133-1409778135377-1409778135378: /bin/hostname" in bursts of several > > > seconds with a gap of roughly 2 minutes in between, e.g. I'm seeing > > bursts > > > with the following intervals in my logs. > > > > > > 16:07:04,603 to 16:07:10,391 > > > 16:09:07,377 to 16:09:13,076 > > > 16:11:10,005 to 16:11:16,770 > > > 16:13:13,291 to 16:13:19,296 > > > 16:15:16,000 to 16:15:21,602 > > > > > > From what I can tell, the delay is on the coaster service side: the C > > > client is just waiting for a response. > > > > > > The jobs are just being submitted through the local job manager, so I > > > wouldn't expect any delays there. The tasks are also just > > "/bin/hostname", > > > so should return immediately. > > > > > > I'm going to continue digging into this on my own, but the 2 minute delay > > > seems like a big clue: does anyone have an idea what could cause stalls > > in > > > task submission of 2 minute duration? > > > > > > Cheers, > > > Tim > > > > > > From tim.g.armstrong at gmail.com Thu Sep 4 13:11:04 2014 From: tim.g.armstrong at gmail.com (Tim Armstrong) Date: Thu, 4 Sep 2014 13:11:04 -0500 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: <1409801722.21132.8.camel@echo> References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo> Message-ID: This is all running locally on my laptop, so I think we can rule out 1). It also seems like it's a state the coaster service gets into after a few client sessions: generally the first coaster run works fine, then after a few runs the problem occurs more frequently. I'm going to try and get worker logs, in the meantime i've got some jstacks (attached). Matching service logs (largish) are here if needed: http://people.cs.uchicago.edu/~tga/service.out.gz On Wed, Sep 3, 2014 at 10:35 PM, Mihael Hategan wrote: > Ah, makes sense. > > 2 minutes is the channel timeout. Each live connection is guaranteed to > have some communication for any 2 minute time window, partially due to > periodic heartbeats (sent every 1 minute). If no packets flow for the > duration of 2 minutes, the connection is assumed broken and all jobs > that were submitted to the respective workers are considered failed. So > there seems to be an issue with the connections to some of the workers, > and it takes 2 minutes to detect them. > > Since the service seems to be alive (although a jstack on the service > when thing seem to hang might help), this leaves two possibilities: > 1 - some genuine network problem > 2 - the worker died without properly closing TCP connections > > If (2), you could enable worker logging > (Settings::Key::WORKER_LOGGING_LEVEL = "DEBUG") to see if anything shows > up. > > Mihael > > On Wed, 2014-09-03 at 20:26 -0500, Tim Armstrong wrote: > > Here are client and service logs, with part of service log edited down to > > be a reasonable size (I have the full thing if needed, but it was over a > > gigabyte). > > > > One relevant section is from 19:49:35 onwards. The client submits 4 jobs > > (its limit), but they don't complete until 19:51:32 or so (I can see that > > one task completed based on ncompleted=1 in the check_tasks log message). > > It looks like something has happened with broken pipes and workers being > > lost, but I'm not sure what the ultimate cause of that is likely to be. > > > > - Tim > > > > > > > > On Wed, Sep 3, 2014 at 6:20 PM, Mihael Hategan > wrote: > > > > > Hi Tim, > > > > > > I've never seen this before with pure Java. > > > > > > Do you have logs from these runs? > > > > > > Mihael > > > > > > On Wed, 2014-09-03 at 16:49 -0500, Tim Armstrong wrote: > > > > I'm running a test Swift/T script that submit tasks to Coasters > through > > > the > > > > C++ client and I'm seeing some odd behaviour where task > > > > submission/execution is stalling for ~2 minute periods. For > example, I'm > > > > seeing submit log messages like "submitting > > > > urn:133-1409778135377-1409778135378: /bin/hostname" in bursts of > several > > > > seconds with a gap of roughly 2 minutes in between, e.g. I'm seeing > > > bursts > > > > with the following intervals in my logs. > > > > > > > > 16:07:04,603 to 16:07:10,391 > > > > 16:09:07,377 to 16:09:13,076 > > > > 16:11:10,005 to 16:11:16,770 > > > > 16:13:13,291 to 16:13:19,296 > > > > 16:15:16,000 to 16:15:21,602 > > > > > > > > From what I can tell, the delay is on the coaster service side: the C > > > > client is just waiting for a response. > > > > > > > > The jobs are just being submitted through the local job manager, so I > > > > wouldn't expect any delays there. The tasks are also just > > > "/bin/hostname", > > > > so should return immediately. > > > > > > > > I'm going to continue digging into this on my own, but the 2 minute > delay > > > > seems like a big clue: does anyone have an idea what could cause > stalls > > > in > > > > task submission of 2 minute duration? > > > > > > > > Cheers, > > > > Tim > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: hostnames-run1.out Type: application/octet-stream Size: 310493 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: hostnames-run2.out Type: application/octet-stream Size: 4461088 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: jstack.out Type: application/octet-stream Size: 113681 bytes Desc: not available URL: From tim.g.armstrong at gmail.com Thu Sep 4 14:35:29 2014 From: tim.g.armstrong at gmail.com (Tim Armstrong) Date: Thu, 4 Sep 2014 14:35:29 -0500 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo> Message-ID: Ok, now I have some worker logs: http://people.cs.uchicago.edu/~tga/2014-9-4-worker-logs.tar.gz There's nothing obvious I see in the worker logs that would indicate why the connection was broken. - Tim On Thu, Sep 4, 2014 at 1:11 PM, Tim Armstrong wrote: > This is all running locally on my laptop, so I think we can rule out 1). > > It also seems like it's a state the coaster service gets into after a few > client sessions: generally the first coaster run works fine, then after a > few runs the problem occurs more frequently. > > I'm going to try and get worker logs, in the meantime i've got some > jstacks (attached). > > Matching service logs (largish) are here if needed: > http://people.cs.uchicago.edu/~tga/service.out.gz > > > On Wed, Sep 3, 2014 at 10:35 PM, Mihael Hategan > wrote: > >> Ah, makes sense. >> >> 2 minutes is the channel timeout. Each live connection is guaranteed to >> have some communication for any 2 minute time window, partially due to >> periodic heartbeats (sent every 1 minute). If no packets flow for the >> duration of 2 minutes, the connection is assumed broken and all jobs >> that were submitted to the respective workers are considered failed. So >> there seems to be an issue with the connections to some of the workers, >> and it takes 2 minutes to detect them. >> >> Since the service seems to be alive (although a jstack on the service >> when thing seem to hang might help), this leaves two possibilities: >> 1 - some genuine network problem >> 2 - the worker died without properly closing TCP connections >> >> If (2), you could enable worker logging >> (Settings::Key::WORKER_LOGGING_LEVEL = "DEBUG") to see if anything shows >> up. >> >> Mihael >> >> On Wed, 2014-09-03 at 20:26 -0500, Tim Armstrong wrote: >> > Here are client and service logs, with part of service log edited down >> to >> > be a reasonable size (I have the full thing if needed, but it was over a >> > gigabyte). >> > >> > One relevant section is from 19:49:35 onwards. The client submits 4 >> jobs >> > (its limit), but they don't complete until 19:51:32 or so (I can see >> that >> > one task completed based on ncompleted=1 in the check_tasks log >> message). >> > It looks like something has happened with broken pipes and workers being >> > lost, but I'm not sure what the ultimate cause of that is likely to be. >> > >> > - Tim >> > >> > >> > >> > On Wed, Sep 3, 2014 at 6:20 PM, Mihael Hategan >> wrote: >> > >> > > Hi Tim, >> > > >> > > I've never seen this before with pure Java. >> > > >> > > Do you have logs from these runs? >> > > >> > > Mihael >> > > >> > > On Wed, 2014-09-03 at 16:49 -0500, Tim Armstrong wrote: >> > > > I'm running a test Swift/T script that submit tasks to Coasters >> through >> > > the >> > > > C++ client and I'm seeing some odd behaviour where task >> > > > submission/execution is stalling for ~2 minute periods. For >> example, I'm >> > > > seeing submit log messages like "submitting >> > > > urn:133-1409778135377-1409778135378: /bin/hostname" in bursts of >> several >> > > > seconds with a gap of roughly 2 minutes in between, e.g. I'm seeing >> > > bursts >> > > > with the following intervals in my logs. >> > > > >> > > > 16:07:04,603 to 16:07:10,391 >> > > > 16:09:07,377 to 16:09:13,076 >> > > > 16:11:10,005 to 16:11:16,770 >> > > > 16:13:13,291 to 16:13:19,296 >> > > > 16:15:16,000 to 16:15:21,602 >> > > > >> > > > From what I can tell, the delay is on the coaster service side: the >> C >> > > > client is just waiting for a response. >> > > > >> > > > The jobs are just being submitted through the local job manager, so >> I >> > > > wouldn't expect any delays there. The tasks are also just >> > > "/bin/hostname", >> > > > so should return immediately. >> > > > >> > > > I'm going to continue digging into this on my own, but the 2 minute >> delay >> > > > seems like a big clue: does anyone have an idea what could cause >> stalls >> > > in >> > > > task submission of 2 minute duration? >> > > > >> > > > Cheers, >> > > > Tim >> > > >> > > >> > > >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Sep 4 15:03:06 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 4 Sep 2014 13:03:06 -0700 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo> Message-ID: <1409860986.7960.3.camel@echo> The first worker "failing" is 0904-20022331. The log looks funny at the end. Can you git pull and re-run? The worker is getting some command at the end there and doing nothing about it and I wonder why. Mihael On Thu, 2014-09-04 at 14:35 -0500, Tim Armstrong wrote: > Ok, now I have some worker logs: > > http://people.cs.uchicago.edu/~tga/2014-9-4-worker-logs.tar.gz > > There's nothing obvious I see in the worker logs that would indicate why > the connection was broken. > > - Tim > > > On Thu, Sep 4, 2014 at 1:11 PM, Tim Armstrong > wrote: > > > This is all running locally on my laptop, so I think we can rule out 1). > > > > It also seems like it's a state the coaster service gets into after a few > > client sessions: generally the first coaster run works fine, then after a > > few runs the problem occurs more frequently. > > > > I'm going to try and get worker logs, in the meantime i've got some > > jstacks (attached). > > > > Matching service logs (largish) are here if needed: > > http://people.cs.uchicago.edu/~tga/service.out.gz > > > > > > On Wed, Sep 3, 2014 at 10:35 PM, Mihael Hategan > > wrote: > > > >> Ah, makes sense. > >> > >> 2 minutes is the channel timeout. Each live connection is guaranteed to > >> have some communication for any 2 minute time window, partially due to > >> periodic heartbeats (sent every 1 minute). If no packets flow for the > >> duration of 2 minutes, the connection is assumed broken and all jobs > >> that were submitted to the respective workers are considered failed. So > >> there seems to be an issue with the connections to some of the workers, > >> and it takes 2 minutes to detect them. > >> > >> Since the service seems to be alive (although a jstack on the service > >> when thing seem to hang might help), this leaves two possibilities: > >> 1 - some genuine network problem > >> 2 - the worker died without properly closing TCP connections > >> > >> If (2), you could enable worker logging > >> (Settings::Key::WORKER_LOGGING_LEVEL = "DEBUG") to see if anything shows > >> up. > >> > >> Mihael > >> > >> On Wed, 2014-09-03 at 20:26 -0500, Tim Armstrong wrote: > >> > Here are client and service logs, with part of service log edited down > >> to > >> > be a reasonable size (I have the full thing if needed, but it was over a > >> > gigabyte). > >> > > >> > One relevant section is from 19:49:35 onwards. The client submits 4 > >> jobs > >> > (its limit), but they don't complete until 19:51:32 or so (I can see > >> that > >> > one task completed based on ncompleted=1 in the check_tasks log > >> message). > >> > It looks like something has happened with broken pipes and workers being > >> > lost, but I'm not sure what the ultimate cause of that is likely to be. > >> > > >> > - Tim > >> > > >> > > >> > > >> > On Wed, Sep 3, 2014 at 6:20 PM, Mihael Hategan > >> wrote: > >> > > >> > > Hi Tim, > >> > > > >> > > I've never seen this before with pure Java. > >> > > > >> > > Do you have logs from these runs? > >> > > > >> > > Mihael > >> > > > >> > > On Wed, 2014-09-03 at 16:49 -0500, Tim Armstrong wrote: > >> > > > I'm running a test Swift/T script that submit tasks to Coasters > >> through > >> > > the > >> > > > C++ client and I'm seeing some odd behaviour where task > >> > > > submission/execution is stalling for ~2 minute periods. For > >> example, I'm > >> > > > seeing submit log messages like "submitting > >> > > > urn:133-1409778135377-1409778135378: /bin/hostname" in bursts of > >> several > >> > > > seconds with a gap of roughly 2 minutes in between, e.g. I'm seeing > >> > > bursts > >> > > > with the following intervals in my logs. > >> > > > > >> > > > 16:07:04,603 to 16:07:10,391 > >> > > > 16:09:07,377 to 16:09:13,076 > >> > > > 16:11:10,005 to 16:11:16,770 > >> > > > 16:13:13,291 to 16:13:19,296 > >> > > > 16:15:16,000 to 16:15:21,602 > >> > > > > >> > > > From what I can tell, the delay is on the coaster service side: the > >> C > >> > > > client is just waiting for a response. > >> > > > > >> > > > The jobs are just being submitted through the local job manager, so > >> I > >> > > > wouldn't expect any delays there. The tasks are also just > >> > > "/bin/hostname", > >> > > > so should return immediately. > >> > > > > >> > > > I'm going to continue digging into this on my own, but the 2 minute > >> delay > >> > > > seems like a big clue: does anyone have an idea what could cause > >> stalls > >> > > in > >> > > > task submission of 2 minute duration? > >> > > > > >> > > > Cheers, > >> > > > Tim > >> > > > >> > > > >> > > > >> > >> > >> > > From tim.g.armstrong at gmail.com Thu Sep 4 15:34:17 2014 From: tim.g.armstrong at gmail.com (Tim Armstrong) Date: Thu, 4 Sep 2014 15:34:17 -0500 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: <1409860986.7960.3.camel@echo> References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo> <1409860986.7960.3.camel@echo> Message-ID: Should be here: http://people.cs.uchicago.edu/~tga/worker-logs2.tar.gz On Thu, Sep 4, 2014 at 3:03 PM, Mihael Hategan wrote: > The first worker "failing" is 0904-20022331. The log looks funny at the > end. > > Can you git pull and re-run? The worker is getting some command at the > end there and doing nothing about it and I wonder why. > > Mihael > > On Thu, 2014-09-04 at 14:35 -0500, Tim Armstrong wrote: > > Ok, now I have some worker logs: > > > > http://people.cs.uchicago.edu/~tga/2014-9-4-worker-logs.tar.gz > > > > There's nothing obvious I see in the worker logs that would indicate why > > the connection was broken. > > > > - Tim > > > > > > On Thu, Sep 4, 2014 at 1:11 PM, Tim Armstrong > > > wrote: > > > > > This is all running locally on my laptop, so I think we can rule out > 1). > > > > > > It also seems like it's a state the coaster service gets into after a > few > > > client sessions: generally the first coaster run works fine, then > after a > > > few runs the problem occurs more frequently. > > > > > > I'm going to try and get worker logs, in the meantime i've got some > > > jstacks (attached). > > > > > > Matching service logs (largish) are here if needed: > > > http://people.cs.uchicago.edu/~tga/service.out.gz > > > > > > > > > On Wed, Sep 3, 2014 at 10:35 PM, Mihael Hategan > > > wrote: > > > > > >> Ah, makes sense. > > >> > > >> 2 minutes is the channel timeout. Each live connection is guaranteed > to > > >> have some communication for any 2 minute time window, partially due to > > >> periodic heartbeats (sent every 1 minute). If no packets flow for the > > >> duration of 2 minutes, the connection is assumed broken and all jobs > > >> that were submitted to the respective workers are considered failed. > So > > >> there seems to be an issue with the connections to some of the > workers, > > >> and it takes 2 minutes to detect them. > > >> > > >> Since the service seems to be alive (although a jstack on the service > > >> when thing seem to hang might help), this leaves two possibilities: > > >> 1 - some genuine network problem > > >> 2 - the worker died without properly closing TCP connections > > >> > > >> If (2), you could enable worker logging > > >> (Settings::Key::WORKER_LOGGING_LEVEL = "DEBUG") to see if anything > shows > > >> up. > > >> > > >> Mihael > > >> > > >> On Wed, 2014-09-03 at 20:26 -0500, Tim Armstrong wrote: > > >> > Here are client and service logs, with part of service log edited > down > > >> to > > >> > be a reasonable size (I have the full thing if needed, but it was > over a > > >> > gigabyte). > > >> > > > >> > One relevant section is from 19:49:35 onwards. The client submits 4 > > >> jobs > > >> > (its limit), but they don't complete until 19:51:32 or so (I can see > > >> that > > >> > one task completed based on ncompleted=1 in the check_tasks log > > >> message). > > >> > It looks like something has happened with broken pipes and workers > being > > >> > lost, but I'm not sure what the ultimate cause of that is likely to > be. > > >> > > > >> > - Tim > > >> > > > >> > > > >> > > > >> > On Wed, Sep 3, 2014 at 6:20 PM, Mihael Hategan > > > >> wrote: > > >> > > > >> > > Hi Tim, > > >> > > > > >> > > I've never seen this before with pure Java. > > >> > > > > >> > > Do you have logs from these runs? > > >> > > > > >> > > Mihael > > >> > > > > >> > > On Wed, 2014-09-03 at 16:49 -0500, Tim Armstrong wrote: > > >> > > > I'm running a test Swift/T script that submit tasks to Coasters > > >> through > > >> > > the > > >> > > > C++ client and I'm seeing some odd behaviour where task > > >> > > > submission/execution is stalling for ~2 minute periods. For > > >> example, I'm > > >> > > > seeing submit log messages like "submitting > > >> > > > urn:133-1409778135377-1409778135378: /bin/hostname" in bursts of > > >> several > > >> > > > seconds with a gap of roughly 2 minutes in between, e.g. I'm > seeing > > >> > > bursts > > >> > > > with the following intervals in my logs. > > >> > > > > > >> > > > 16:07:04,603 to 16:07:10,391 > > >> > > > 16:09:07,377 to 16:09:13,076 > > >> > > > 16:11:10,005 to 16:11:16,770 > > >> > > > 16:13:13,291 to 16:13:19,296 > > >> > > > 16:15:16,000 to 16:15:21,602 > > >> > > > > > >> > > > From what I can tell, the delay is on the coaster service side: > the > > >> C > > >> > > > client is just waiting for a response. > > >> > > > > > >> > > > The jobs are just being submitted through the local job > manager, so > > >> I > > >> > > > wouldn't expect any delays there. The tasks are also just > > >> > > "/bin/hostname", > > >> > > > so should return immediately. > > >> > > > > > >> > > > I'm going to continue digging into this on my own, but the 2 > minute > > >> delay > > >> > > > seems like a big clue: does anyone have an idea what could cause > > >> stalls > > >> > > in > > >> > > > task submission of 2 minute duration? > > >> > > > > > >> > > > Cheers, > > >> > > > Tim > > >> > > > > >> > > > > >> > > > > >> > > >> > > >> > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Sep 4 19:27:18 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 4 Sep 2014 17:27:18 -0700 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo> <1409860986.7960.3.camel@echo> Message-ID: <1409876838.3600.8.camel@echo> Ok, so that's legit. It does look like shut down workers are not properly accounted for in some places (and I believe Yadu submitted a bug for this). However, I do not see the dead time you mention in either of the last two sets of logs. It looks like each client instance submits a continous stream of jobs. So let's get back to the initial log. Can I have the full service log? I'm trying to track what happened with the jobs submitted before the first big pause. Also, a log message in CoasterClient::updateJobStatus() (or friends) would probably help a lot here. Mihael On Thu, 2014-09-04 at 15:34 -0500, Tim Armstrong wrote: > Should be here: > > http://people.cs.uchicago.edu/~tga/worker-logs2.tar.gz > > > > > On Thu, Sep 4, 2014 at 3:03 PM, Mihael Hategan wrote: > > > The first worker "failing" is 0904-20022331. The log looks funny at the > > end. > > > > Can you git pull and re-run? The worker is getting some command at the > > end there and doing nothing about it and I wonder why. > > > > Mihael > > > > On Thu, 2014-09-04 at 14:35 -0500, Tim Armstrong wrote: > > > Ok, now I have some worker logs: > > > > > > http://people.cs.uchicago.edu/~tga/2014-9-4-worker-logs.tar.gz > > > > > > There's nothing obvious I see in the worker logs that would indicate why > > > the connection was broken. > > > > > > - Tim > > > > > > > > > On Thu, Sep 4, 2014 at 1:11 PM, Tim Armstrong > > > > > wrote: > > > > > > > This is all running locally on my laptop, so I think we can rule out > > 1). > > > > > > > > It also seems like it's a state the coaster service gets into after a > > few > > > > client sessions: generally the first coaster run works fine, then > > after a > > > > few runs the problem occurs more frequently. > > > > > > > > I'm going to try and get worker logs, in the meantime i've got some > > > > jstacks (attached). > > > > > > > > Matching service logs (largish) are here if needed: > > > > http://people.cs.uchicago.edu/~tga/service.out.gz > > > > > > > > > > > > On Wed, Sep 3, 2014 at 10:35 PM, Mihael Hategan > > > > wrote: > > > > > > > >> Ah, makes sense. > > > >> > > > >> 2 minutes is the channel timeout. Each live connection is guaranteed > > to > > > >> have some communication for any 2 minute time window, partially due to > > > >> periodic heartbeats (sent every 1 minute). If no packets flow for the > > > >> duration of 2 minutes, the connection is assumed broken and all jobs > > > >> that were submitted to the respective workers are considered failed. > > So > > > >> there seems to be an issue with the connections to some of the > > workers, > > > >> and it takes 2 minutes to detect them. > > > >> > > > >> Since the service seems to be alive (although a jstack on the service > > > >> when thing seem to hang might help), this leaves two possibilities: > > > >> 1 - some genuine network problem > > > >> 2 - the worker died without properly closing TCP connections > > > >> > > > >> If (2), you could enable worker logging > > > >> (Settings::Key::WORKER_LOGGING_LEVEL = "DEBUG") to see if anything > > shows > > > >> up. > > > >> > > > >> Mihael > > > >> > > > >> On Wed, 2014-09-03 at 20:26 -0500, Tim Armstrong wrote: > > > >> > Here are client and service logs, with part of service log edited > > down > > > >> to > > > >> > be a reasonable size (I have the full thing if needed, but it was > > over a > > > >> > gigabyte). > > > >> > > > > >> > One relevant section is from 19:49:35 onwards. The client submits 4 > > > >> jobs > > > >> > (its limit), but they don't complete until 19:51:32 or so (I can see > > > >> that > > > >> > one task completed based on ncompleted=1 in the check_tasks log > > > >> message). > > > >> > It looks like something has happened with broken pipes and workers > > being > > > >> > lost, but I'm not sure what the ultimate cause of that is likely to > > be. > > > >> > > > > >> > - Tim > > > >> > > > > >> > > > > >> > > > > >> > On Wed, Sep 3, 2014 at 6:20 PM, Mihael Hategan > > > > > >> wrote: > > > >> > > > > >> > > Hi Tim, > > > >> > > > > > >> > > I've never seen this before with pure Java. > > > >> > > > > > >> > > Do you have logs from these runs? > > > >> > > > > > >> > > Mihael > > > >> > > > > > >> > > On Wed, 2014-09-03 at 16:49 -0500, Tim Armstrong wrote: > > > >> > > > I'm running a test Swift/T script that submit tasks to Coasters > > > >> through > > > >> > > the > > > >> > > > C++ client and I'm seeing some odd behaviour where task > > > >> > > > submission/execution is stalling for ~2 minute periods. For > > > >> example, I'm > > > >> > > > seeing submit log messages like "submitting > > > >> > > > urn:133-1409778135377-1409778135378: /bin/hostname" in bursts of > > > >> several > > > >> > > > seconds with a gap of roughly 2 minutes in between, e.g. I'm > > seeing > > > >> > > bursts > > > >> > > > with the following intervals in my logs. > > > >> > > > > > > >> > > > 16:07:04,603 to 16:07:10,391 > > > >> > > > 16:09:07,377 to 16:09:13,076 > > > >> > > > 16:11:10,005 to 16:11:16,770 > > > >> > > > 16:13:13,291 to 16:13:19,296 > > > >> > > > 16:15:16,000 to 16:15:21,602 > > > >> > > > > > > >> > > > From what I can tell, the delay is on the coaster service side: > > the > > > >> C > > > >> > > > client is just waiting for a response. > > > >> > > > > > > >> > > > The jobs are just being submitted through the local job > > manager, so > > > >> I > > > >> > > > wouldn't expect any delays there. The tasks are also just > > > >> > > "/bin/hostname", > > > >> > > > so should return immediately. > > > >> > > > > > > >> > > > I'm going to continue digging into this on my own, but the 2 > > minute > > > >> delay > > > >> > > > seems like a big clue: does anyone have an idea what could cause > > > >> stalls > > > >> > > in > > > >> > > > task submission of 2 minute duration? > > > >> > > > > > > >> > > > Cheers, > > > >> > > > Tim > > > >> > > > > > >> > > > > > >> > > > > > >> > > > >> > > > >> > > > > > > > > > > From tim.g.armstrong at gmail.com Fri Sep 5 08:55:04 2014 From: tim.g.armstrong at gmail.com (Tim Armstrong) Date: Fri, 5 Sep 2014 08:55:04 -0500 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: <1409876838.3600.8.camel@echo> References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo> <1409860986.7960.3.camel@echo> <1409876838.3600.8.camel@echo> Message-ID: It's here: http://people.cs.uchicago.edu/~tga/files/coaster-service.out.full.gz . I'll add some extra debug messages in the coaster C++ client and see if I can recreate the scenario. - Tim On Thu, Sep 4, 2014 at 7:27 PM, Mihael Hategan wrote: > Ok, so that's legit. > > It does look like shut down workers are not properly accounted for in > some places (and I believe Yadu submitted a bug for this). However, I do > not see the dead time you mention in either of the last two sets of > logs. It looks like each client instance submits a continous stream of > jobs. > > So let's get back to the initial log. Can I have the full service log? > I'm trying to track what happened with the jobs submitted before the > first big pause. > > Also, a log message in CoasterClient::updateJobStatus() (or friends) > would probably help a lot here. > > Mihael > > On Thu, 2014-09-04 at 15:34 -0500, Tim Armstrong wrote: > > Should be here: > > > > http://people.cs.uchicago.edu/~tga/worker-logs2.tar.gz > > > > > > > > > > On Thu, Sep 4, 2014 at 3:03 PM, Mihael Hategan > wrote: > > > > > The first worker "failing" is 0904-20022331. The log looks funny at the > > > end. > > > > > > Can you git pull and re-run? The worker is getting some command at the > > > end there and doing nothing about it and I wonder why. > > > > > > Mihael > > > > > > On Thu, 2014-09-04 at 14:35 -0500, Tim Armstrong wrote: > > > > Ok, now I have some worker logs: > > > > > > > > http://people.cs.uchicago.edu/~tga/2014-9-4-worker-logs.tar.gz > > > > > > > > There's nothing obvious I see in the worker logs that would indicate > why > > > > the connection was broken. > > > > > > > > - Tim > > > > > > > > > > > > On Thu, Sep 4, 2014 at 1:11 PM, Tim Armstrong < > tim.g.armstrong at gmail.com > > > > > > > > wrote: > > > > > > > > > This is all running locally on my laptop, so I think we can rule > out > > > 1). > > > > > > > > > > It also seems like it's a state the coaster service gets into > after a > > > few > > > > > client sessions: generally the first coaster run works fine, then > > > after a > > > > > few runs the problem occurs more frequently. > > > > > > > > > > I'm going to try and get worker logs, in the meantime i've got some > > > > > jstacks (attached). > > > > > > > > > > Matching service logs (largish) are here if needed: > > > > > http://people.cs.uchicago.edu/~tga/service.out.gz > > > > > > > > > > > > > > > On Wed, Sep 3, 2014 at 10:35 PM, Mihael Hategan < > hategan at mcs.anl.gov> > > > > > wrote: > > > > > > > > > >> Ah, makes sense. > > > > >> > > > > >> 2 minutes is the channel timeout. Each live connection is > guaranteed > > > to > > > > >> have some communication for any 2 minute time window, partially > due to > > > > >> periodic heartbeats (sent every 1 minute). If no packets flow for > the > > > > >> duration of 2 minutes, the connection is assumed broken and all > jobs > > > > >> that were submitted to the respective workers are considered > failed. > > > So > > > > >> there seems to be an issue with the connections to some of the > > > workers, > > > > >> and it takes 2 minutes to detect them. > > > > >> > > > > >> Since the service seems to be alive (although a jstack on the > service > > > > >> when thing seem to hang might help), this leaves two > possibilities: > > > > >> 1 - some genuine network problem > > > > >> 2 - the worker died without properly closing TCP connections > > > > >> > > > > >> If (2), you could enable worker logging > > > > >> (Settings::Key::WORKER_LOGGING_LEVEL = "DEBUG") to see if anything > > > shows > > > > >> up. > > > > >> > > > > >> Mihael > > > > >> > > > > >> On Wed, 2014-09-03 at 20:26 -0500, Tim Armstrong wrote: > > > > >> > Here are client and service logs, with part of service log > edited > > > down > > > > >> to > > > > >> > be a reasonable size (I have the full thing if needed, but it > was > > > over a > > > > >> > gigabyte). > > > > >> > > > > > >> > One relevant section is from 19:49:35 onwards. The client > submits 4 > > > > >> jobs > > > > >> > (its limit), but they don't complete until 19:51:32 or so (I > can see > > > > >> that > > > > >> > one task completed based on ncompleted=1 in the check_tasks log > > > > >> message). > > > > >> > It looks like something has happened with broken pipes and > workers > > > being > > > > >> > lost, but I'm not sure what the ultimate cause of that is > likely to > > > be. > > > > >> > > > > > >> > - Tim > > > > >> > > > > > >> > > > > > >> > > > > > >> > On Wed, Sep 3, 2014 at 6:20 PM, Mihael Hategan < > hategan at mcs.anl.gov > > > > > > > > >> wrote: > > > > >> > > > > > >> > > Hi Tim, > > > > >> > > > > > > >> > > I've never seen this before with pure Java. > > > > >> > > > > > > >> > > Do you have logs from these runs? > > > > >> > > > > > > >> > > Mihael > > > > >> > > > > > > >> > > On Wed, 2014-09-03 at 16:49 -0500, Tim Armstrong wrote: > > > > >> > > > I'm running a test Swift/T script that submit tasks to > Coasters > > > > >> through > > > > >> > > the > > > > >> > > > C++ client and I'm seeing some odd behaviour where task > > > > >> > > > submission/execution is stalling for ~2 minute periods. For > > > > >> example, I'm > > > > >> > > > seeing submit log messages like "submitting > > > > >> > > > urn:133-1409778135377-1409778135378: /bin/hostname" in > bursts of > > > > >> several > > > > >> > > > seconds with a gap of roughly 2 minutes in between, e.g. I'm > > > seeing > > > > >> > > bursts > > > > >> > > > with the following intervals in my logs. > > > > >> > > > > > > > >> > > > 16:07:04,603 to 16:07:10,391 > > > > >> > > > 16:09:07,377 to 16:09:13,076 > > > > >> > > > 16:11:10,005 to 16:11:16,770 > > > > >> > > > 16:13:13,291 to 16:13:19,296 > > > > >> > > > 16:15:16,000 to 16:15:21,602 > > > > >> > > > > > > > >> > > > From what I can tell, the delay is on the coaster service > side: > > > the > > > > >> C > > > > >> > > > client is just waiting for a response. > > > > >> > > > > > > > >> > > > The jobs are just being submitted through the local job > > > manager, so > > > > >> I > > > > >> > > > wouldn't expect any delays there. The tasks are also just > > > > >> > > "/bin/hostname", > > > > >> > > > so should return immediately. > > > > >> > > > > > > > >> > > > I'm going to continue digging into this on my own, but the 2 > > > minute > > > > >> delay > > > > >> > > > seems like a big clue: does anyone have an idea what could > cause > > > > >> stalls > > > > >> > > in > > > > >> > > > task submission of 2 minute duration? > > > > >> > > > > > > > >> > > > Cheers, > > > > >> > > > Tim > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > >> > > > > >> > > > > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tim.g.armstrong at gmail.com Fri Sep 5 12:13:02 2014 From: tim.g.armstrong at gmail.com (Tim Armstrong) Date: Fri, 5 Sep 2014 12:13:02 -0500 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo> <1409860986.7960.3.camel@echo> <1409876838.3600.8.camel@echo> Message-ID: Ok, here it is with the additional debug messages. Source code change is in commit 890c41f2ba701b10264553471590096d6f94c278. Warning: the tarball will expand to several gigabytes of logs. I had to do multiple client runs to trigger it. It seems like the problem might be triggered by abnormal termination of the client. First 18 runs went fine, problem only started when I ctrl-c-ed the swift/t run #19 before the run #20 that exhibited delays. http://people.cs.uchicago.edu/~tga/files/worker-logs3.tar.gz - Tim On Fri, Sep 5, 2014 at 8:55 AM, Tim Armstrong wrote: > It's here: > http://people.cs.uchicago.edu/~tga/files/coaster-service.out.full.gz . > > I'll add some extra debug messages in the coaster C++ client and see if I > can recreate the scenario. > > - Tim > > > On Thu, Sep 4, 2014 at 7:27 PM, Mihael Hategan > wrote: > >> Ok, so that's legit. >> >> It does look like shut down workers are not properly accounted for in >> some places (and I believe Yadu submitted a bug for this). However, I do >> not see the dead time you mention in either of the last two sets of >> logs. It looks like each client instance submits a continous stream of >> jobs. >> >> So let's get back to the initial log. Can I have the full service log? >> I'm trying to track what happened with the jobs submitted before the >> first big pause. >> >> Also, a log message in CoasterClient::updateJobStatus() (or friends) >> would probably help a lot here. >> >> Mihael >> >> On Thu, 2014-09-04 at 15:34 -0500, Tim Armstrong wrote: >> > Should be here: >> > >> > http://people.cs.uchicago.edu/~tga/worker-logs2.tar.gz >> > >> > >> > >> > >> > On Thu, Sep 4, 2014 at 3:03 PM, Mihael Hategan >> wrote: >> > >> > > The first worker "failing" is 0904-20022331. The log looks funny at >> the >> > > end. >> > > >> > > Can you git pull and re-run? The worker is getting some command at the >> > > end there and doing nothing about it and I wonder why. >> > > >> > > Mihael >> > > >> > > On Thu, 2014-09-04 at 14:35 -0500, Tim Armstrong wrote: >> > > > Ok, now I have some worker logs: >> > > > >> > > > http://people.cs.uchicago.edu/~tga/2014-9-4-worker-logs.tar.gz >> > > > >> > > > There's nothing obvious I see in the worker logs that would >> indicate why >> > > > the connection was broken. >> > > > >> > > > - Tim >> > > > >> > > > >> > > > On Thu, Sep 4, 2014 at 1:11 PM, Tim Armstrong < >> tim.g.armstrong at gmail.com >> > > > >> > > > wrote: >> > > > >> > > > > This is all running locally on my laptop, so I think we can rule >> out >> > > 1). >> > > > > >> > > > > It also seems like it's a state the coaster service gets into >> after a >> > > few >> > > > > client sessions: generally the first coaster run works fine, then >> > > after a >> > > > > few runs the problem occurs more frequently. >> > > > > >> > > > > I'm going to try and get worker logs, in the meantime i've got >> some >> > > > > jstacks (attached). >> > > > > >> > > > > Matching service logs (largish) are here if needed: >> > > > > http://people.cs.uchicago.edu/~tga/service.out.gz >> > > > > >> > > > > >> > > > > On Wed, Sep 3, 2014 at 10:35 PM, Mihael Hategan < >> hategan at mcs.anl.gov> >> > > > > wrote: >> > > > > >> > > > >> Ah, makes sense. >> > > > >> >> > > > >> 2 minutes is the channel timeout. Each live connection is >> guaranteed >> > > to >> > > > >> have some communication for any 2 minute time window, partially >> due to >> > > > >> periodic heartbeats (sent every 1 minute). If no packets flow >> for the >> > > > >> duration of 2 minutes, the connection is assumed broken and all >> jobs >> > > > >> that were submitted to the respective workers are considered >> failed. >> > > So >> > > > >> there seems to be an issue with the connections to some of the >> > > workers, >> > > > >> and it takes 2 minutes to detect them. >> > > > >> >> > > > >> Since the service seems to be alive (although a jstack on the >> service >> > > > >> when thing seem to hang might help), this leaves two >> possibilities: >> > > > >> 1 - some genuine network problem >> > > > >> 2 - the worker died without properly closing TCP connections >> > > > >> >> > > > >> If (2), you could enable worker logging >> > > > >> (Settings::Key::WORKER_LOGGING_LEVEL = "DEBUG") to see if >> anything >> > > shows >> > > > >> up. >> > > > >> >> > > > >> Mihael >> > > > >> >> > > > >> On Wed, 2014-09-03 at 20:26 -0500, Tim Armstrong wrote: >> > > > >> > Here are client and service logs, with part of service log >> edited >> > > down >> > > > >> to >> > > > >> > be a reasonable size (I have the full thing if needed, but it >> was >> > > over a >> > > > >> > gigabyte). >> > > > >> > >> > > > >> > One relevant section is from 19:49:35 onwards. The client >> submits 4 >> > > > >> jobs >> > > > >> > (its limit), but they don't complete until 19:51:32 or so (I >> can see >> > > > >> that >> > > > >> > one task completed based on ncompleted=1 in the check_tasks log >> > > > >> message). >> > > > >> > It looks like something has happened with broken pipes and >> workers >> > > being >> > > > >> > lost, but I'm not sure what the ultimate cause of that is >> likely to >> > > be. >> > > > >> > >> > > > >> > - Tim >> > > > >> > >> > > > >> > >> > > > >> > >> > > > >> > On Wed, Sep 3, 2014 at 6:20 PM, Mihael Hategan < >> hategan at mcs.anl.gov >> > > > >> > > > >> wrote: >> > > > >> > >> > > > >> > > Hi Tim, >> > > > >> > > >> > > > >> > > I've never seen this before with pure Java. >> > > > >> > > >> > > > >> > > Do you have logs from these runs? >> > > > >> > > >> > > > >> > > Mihael >> > > > >> > > >> > > > >> > > On Wed, 2014-09-03 at 16:49 -0500, Tim Armstrong wrote: >> > > > >> > > > I'm running a test Swift/T script that submit tasks to >> Coasters >> > > > >> through >> > > > >> > > the >> > > > >> > > > C++ client and I'm seeing some odd behaviour where task >> > > > >> > > > submission/execution is stalling for ~2 minute periods. >> For >> > > > >> example, I'm >> > > > >> > > > seeing submit log messages like "submitting >> > > > >> > > > urn:133-1409778135377-1409778135378: /bin/hostname" in >> bursts of >> > > > >> several >> > > > >> > > > seconds with a gap of roughly 2 minutes in between, e.g. >> I'm >> > > seeing >> > > > >> > > bursts >> > > > >> > > > with the following intervals in my logs. >> > > > >> > > > >> > > > >> > > > 16:07:04,603 to 16:07:10,391 >> > > > >> > > > 16:09:07,377 to 16:09:13,076 >> > > > >> > > > 16:11:10,005 to 16:11:16,770 >> > > > >> > > > 16:13:13,291 to 16:13:19,296 >> > > > >> > > > 16:15:16,000 to 16:15:21,602 >> > > > >> > > > >> > > > >> > > > From what I can tell, the delay is on the coaster service >> side: >> > > the >> > > > >> C >> > > > >> > > > client is just waiting for a response. >> > > > >> > > > >> > > > >> > > > The jobs are just being submitted through the local job >> > > manager, so >> > > > >> I >> > > > >> > > > wouldn't expect any delays there. The tasks are also just >> > > > >> > > "/bin/hostname", >> > > > >> > > > so should return immediately. >> > > > >> > > > >> > > > >> > > > I'm going to continue digging into this on my own, but the >> 2 >> > > minute >> > > > >> delay >> > > > >> > > > seems like a big clue: does anyone have an idea what could >> cause >> > > > >> stalls >> > > > >> > > in >> > > > >> > > > task submission of 2 minute duration? >> > > > >> > > > >> > > > >> > > > Cheers, >> > > > >> > > > Tim >> > > > >> > > >> > > > >> > > >> > > > >> > > >> > > > >> >> > > > >> >> > > > >> >> > > > > >> > > >> > > >> > > >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Fri Sep 5 12:57:18 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 5 Sep 2014 10:57:18 -0700 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo> <1409860986.7960.3.camel@echo> <1409876838.3600.8.camel@echo> Message-ID: <1409939838.12288.3.camel@echo> Thanks. It also seems that there is an older bug in there in which the client connection is not properly accounted for and things start failing two minutes after the client connects (which is also probably why you didn't see this in runs with many short client connections). I'm not sure why the fix for that bug isn't in the trunk code. In any event, I'll set up a client submission loop and fix all these things. Mihael On Fri, 2014-09-05 at 12:13 -0500, Tim Armstrong wrote: > Ok, here it is with the additional debug messages. Source code change is > in commit 890c41f2ba701b10264553471590096d6f94c278. > > Warning: the tarball will expand to several gigabytes of logs. > > I had to do multiple client runs to trigger it. It seems like the problem > might be triggered by abnormal termination of the client. First 18 runs > went fine, problem only started when I ctrl-c-ed the swift/t run #19 before > the run #20 that exhibited delays. > > http://people.cs.uchicago.edu/~tga/files/worker-logs3.tar.gz > > - Tim > > > On Fri, Sep 5, 2014 at 8:55 AM, Tim Armstrong > wrote: > > > It's here: > > http://people.cs.uchicago.edu/~tga/files/coaster-service.out.full.gz . > > > > I'll add some extra debug messages in the coaster C++ client and see if I > > can recreate the scenario. > > > > - Tim > > > > > > On Thu, Sep 4, 2014 at 7:27 PM, Mihael Hategan > > wrote: > > > >> Ok, so that's legit. > >> > >> It does look like shut down workers are not properly accounted for in > >> some places (and I believe Yadu submitted a bug for this). However, I do > >> not see the dead time you mention in either of the last two sets of > >> logs. It looks like each client instance submits a continous stream of > >> jobs. > >> > >> So let's get back to the initial log. Can I have the full service log? > >> I'm trying to track what happened with the jobs submitted before the > >> first big pause. > >> > >> Also, a log message in CoasterClient::updateJobStatus() (or friends) > >> would probably help a lot here. > >> > >> Mihael > >> > >> On Thu, 2014-09-04 at 15:34 -0500, Tim Armstrong wrote: > >> > Should be here: > >> > > >> > http://people.cs.uchicago.edu/~tga/worker-logs2.tar.gz > >> > > >> > > >> > > >> > > >> > On Thu, Sep 4, 2014 at 3:03 PM, Mihael Hategan > >> wrote: > >> > > >> > > The first worker "failing" is 0904-20022331. The log looks funny at > >> the > >> > > end. > >> > > > >> > > Can you git pull and re-run? The worker is getting some command at the > >> > > end there and doing nothing about it and I wonder why. > >> > > > >> > > Mihael > >> > > > >> > > On Thu, 2014-09-04 at 14:35 -0500, Tim Armstrong wrote: > >> > > > Ok, now I have some worker logs: > >> > > > > >> > > > http://people.cs.uchicago.edu/~tga/2014-9-4-worker-logs.tar.gz > >> > > > > >> > > > There's nothing obvious I see in the worker logs that would > >> indicate why > >> > > > the connection was broken. > >> > > > > >> > > > - Tim > >> > > > > >> > > > > >> > > > On Thu, Sep 4, 2014 at 1:11 PM, Tim Armstrong < > >> tim.g.armstrong at gmail.com > >> > > > > >> > > > wrote: > >> > > > > >> > > > > This is all running locally on my laptop, so I think we can rule > >> out > >> > > 1). > >> > > > > > >> > > > > It also seems like it's a state the coaster service gets into > >> after a > >> > > few > >> > > > > client sessions: generally the first coaster run works fine, then > >> > > after a > >> > > > > few runs the problem occurs more frequently. > >> > > > > > >> > > > > I'm going to try and get worker logs, in the meantime i've got > >> some > >> > > > > jstacks (attached). > >> > > > > > >> > > > > Matching service logs (largish) are here if needed: > >> > > > > http://people.cs.uchicago.edu/~tga/service.out.gz > >> > > > > > >> > > > > > >> > > > > On Wed, Sep 3, 2014 at 10:35 PM, Mihael Hategan < > >> hategan at mcs.anl.gov> > >> > > > > wrote: > >> > > > > > >> > > > >> Ah, makes sense. > >> > > > >> > >> > > > >> 2 minutes is the channel timeout. Each live connection is > >> guaranteed > >> > > to > >> > > > >> have some communication for any 2 minute time window, partially > >> due to > >> > > > >> periodic heartbeats (sent every 1 minute). If no packets flow > >> for the > >> > > > >> duration of 2 minutes, the connection is assumed broken and all > >> jobs > >> > > > >> that were submitted to the respective workers are considered > >> failed. > >> > > So > >> > > > >> there seems to be an issue with the connections to some of the > >> > > workers, > >> > > > >> and it takes 2 minutes to detect them. > >> > > > >> > >> > > > >> Since the service seems to be alive (although a jstack on the > >> service > >> > > > >> when thing seem to hang might help), this leaves two > >> possibilities: > >> > > > >> 1 - some genuine network problem > >> > > > >> 2 - the worker died without properly closing TCP connections > >> > > > >> > >> > > > >> If (2), you could enable worker logging > >> > > > >> (Settings::Key::WORKER_LOGGING_LEVEL = "DEBUG") to see if > >> anything > >> > > shows > >> > > > >> up. > >> > > > >> > >> > > > >> Mihael > >> > > > >> > >> > > > >> On Wed, 2014-09-03 at 20:26 -0500, Tim Armstrong wrote: > >> > > > >> > Here are client and service logs, with part of service log > >> edited > >> > > down > >> > > > >> to > >> > > > >> > be a reasonable size (I have the full thing if needed, but it > >> was > >> > > over a > >> > > > >> > gigabyte). > >> > > > >> > > >> > > > >> > One relevant section is from 19:49:35 onwards. The client > >> submits 4 > >> > > > >> jobs > >> > > > >> > (its limit), but they don't complete until 19:51:32 or so (I > >> can see > >> > > > >> that > >> > > > >> > one task completed based on ncompleted=1 in the check_tasks log > >> > > > >> message). > >> > > > >> > It looks like something has happened with broken pipes and > >> workers > >> > > being > >> > > > >> > lost, but I'm not sure what the ultimate cause of that is > >> likely to > >> > > be. > >> > > > >> > > >> > > > >> > - Tim > >> > > > >> > > >> > > > >> > > >> > > > >> > > >> > > > >> > On Wed, Sep 3, 2014 at 6:20 PM, Mihael Hategan < > >> hategan at mcs.anl.gov > >> > > > > >> > > > >> wrote: > >> > > > >> > > >> > > > >> > > Hi Tim, > >> > > > >> > > > >> > > > >> > > I've never seen this before with pure Java. > >> > > > >> > > > >> > > > >> > > Do you have logs from these runs? > >> > > > >> > > > >> > > > >> > > Mihael > >> > > > >> > > > >> > > > >> > > On Wed, 2014-09-03 at 16:49 -0500, Tim Armstrong wrote: > >> > > > >> > > > I'm running a test Swift/T script that submit tasks to > >> Coasters > >> > > > >> through > >> > > > >> > > the > >> > > > >> > > > C++ client and I'm seeing some odd behaviour where task > >> > > > >> > > > submission/execution is stalling for ~2 minute periods. > >> For > >> > > > >> example, I'm > >> > > > >> > > > seeing submit log messages like "submitting > >> > > > >> > > > urn:133-1409778135377-1409778135378: /bin/hostname" in > >> bursts of > >> > > > >> several > >> > > > >> > > > seconds with a gap of roughly 2 minutes in between, e.g. > >> I'm > >> > > seeing > >> > > > >> > > bursts > >> > > > >> > > > with the following intervals in my logs. > >> > > > >> > > > > >> > > > >> > > > 16:07:04,603 to 16:07:10,391 > >> > > > >> > > > 16:09:07,377 to 16:09:13,076 > >> > > > >> > > > 16:11:10,005 to 16:11:16,770 > >> > > > >> > > > 16:13:13,291 to 16:13:19,296 > >> > > > >> > > > 16:15:16,000 to 16:15:21,602 > >> > > > >> > > > > >> > > > >> > > > From what I can tell, the delay is on the coaster service > >> side: > >> > > the > >> > > > >> C > >> > > > >> > > > client is just waiting for a response. > >> > > > >> > > > > >> > > > >> > > > The jobs are just being submitted through the local job > >> > > manager, so > >> > > > >> I > >> > > > >> > > > wouldn't expect any delays there. The tasks are also just > >> > > > >> > > "/bin/hostname", > >> > > > >> > > > so should return immediately. > >> > > > >> > > > > >> > > > >> > > > I'm going to continue digging into this on my own, but the > >> 2 > >> > > minute > >> > > > >> delay > >> > > > >> > > > seems like a big clue: does anyone have an idea what could > >> cause > >> > > > >> stalls > >> > > > >> > > in > >> > > > >> > > > task submission of 2 minute duration? > >> > > > >> > > > > >> > > > >> > > > Cheers, > >> > > > >> > > > Tim > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > >> > > > >> > >> > > > >> > >> > > > > > >> > > > >> > > > >> > > > >> > >> > >> > > From tim.g.armstrong at gmail.com Fri Sep 5 13:09:00 2014 From: tim.g.armstrong at gmail.com (Tim Armstrong) Date: Fri, 5 Sep 2014 13:09:00 -0500 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: <1409939838.12288.3.camel@echo> References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo> <1409860986.7960.3.camel@echo> <1409876838.3600.8.camel@echo> <1409939838.12288.3.camel@echo> Message-ID: Thanks, let me know if there's anything I can help do. - Tim On Fri, Sep 5, 2014 at 12:57 PM, Mihael Hategan wrote: > Thanks. It also seems that there is an older bug in there in which the > client connection is not properly accounted for and things start failing > two minutes after the client connects (which is also probably why you > didn't see this in runs with many short client connections). I'm not > sure why the fix for that bug isn't in the trunk code. > > In any event, I'll set up a client submission loop and fix all these > things. > > Mihael > > On Fri, 2014-09-05 at 12:13 -0500, Tim Armstrong wrote: > > Ok, here it is with the additional debug messages. Source code change is > > in commit 890c41f2ba701b10264553471590096d6f94c278. > > > > Warning: the tarball will expand to several gigabytes of logs. > > > > I had to do multiple client runs to trigger it. It seems like the > problem > > might be triggered by abnormal termination of the client. First 18 runs > > went fine, problem only started when I ctrl-c-ed the swift/t run #19 > before > > the run #20 that exhibited delays. > > > > http://people.cs.uchicago.edu/~tga/files/worker-logs3.tar.gz > > > > - Tim > > > > > > On Fri, Sep 5, 2014 at 8:55 AM, Tim Armstrong > > > wrote: > > > > > It's here: > > > http://people.cs.uchicago.edu/~tga/files/coaster-service.out.full.gz . > > > > > > I'll add some extra debug messages in the coaster C++ client and see > if I > > > can recreate the scenario. > > > > > > - Tim > > > > > > > > > On Thu, Sep 4, 2014 at 7:27 PM, Mihael Hategan > > > wrote: > > > > > >> Ok, so that's legit. > > >> > > >> It does look like shut down workers are not properly accounted for in > > >> some places (and I believe Yadu submitted a bug for this). However, I > do > > >> not see the dead time you mention in either of the last two sets of > > >> logs. It looks like each client instance submits a continous stream of > > >> jobs. > > >> > > >> So let's get back to the initial log. Can I have the full service log? > > >> I'm trying to track what happened with the jobs submitted before the > > >> first big pause. > > >> > > >> Also, a log message in CoasterClient::updateJobStatus() (or friends) > > >> would probably help a lot here. > > >> > > >> Mihael > > >> > > >> On Thu, 2014-09-04 at 15:34 -0500, Tim Armstrong wrote: > > >> > Should be here: > > >> > > > >> > http://people.cs.uchicago.edu/~tga/worker-logs2.tar.gz > > >> > > > >> > > > >> > > > >> > > > >> > On Thu, Sep 4, 2014 at 3:03 PM, Mihael Hategan > > > >> wrote: > > >> > > > >> > > The first worker "failing" is 0904-20022331. The log looks funny > at > > >> the > > >> > > end. > > >> > > > > >> > > Can you git pull and re-run? The worker is getting some command > at the > > >> > > end there and doing nothing about it and I wonder why. > > >> > > > > >> > > Mihael > > >> > > > > >> > > On Thu, 2014-09-04 at 14:35 -0500, Tim Armstrong wrote: > > >> > > > Ok, now I have some worker logs: > > >> > > > > > >> > > > http://people.cs.uchicago.edu/~tga/2014-9-4-worker-logs.tar.gz > > >> > > > > > >> > > > There's nothing obvious I see in the worker logs that would > > >> indicate why > > >> > > > the connection was broken. > > >> > > > > > >> > > > - Tim > > >> > > > > > >> > > > > > >> > > > On Thu, Sep 4, 2014 at 1:11 PM, Tim Armstrong < > > >> tim.g.armstrong at gmail.com > > >> > > > > > >> > > > wrote: > > >> > > > > > >> > > > > This is all running locally on my laptop, so I think we can > rule > > >> out > > >> > > 1). > > >> > > > > > > >> > > > > It also seems like it's a state the coaster service gets into > > >> after a > > >> > > few > > >> > > > > client sessions: generally the first coaster run works fine, > then > > >> > > after a > > >> > > > > few runs the problem occurs more frequently. > > >> > > > > > > >> > > > > I'm going to try and get worker logs, in the meantime i've got > > >> some > > >> > > > > jstacks (attached). > > >> > > > > > > >> > > > > Matching service logs (largish) are here if needed: > > >> > > > > http://people.cs.uchicago.edu/~tga/service.out.gz > > >> > > > > > > >> > > > > > > >> > > > > On Wed, Sep 3, 2014 at 10:35 PM, Mihael Hategan < > > >> hategan at mcs.anl.gov> > > >> > > > > wrote: > > >> > > > > > > >> > > > >> Ah, makes sense. > > >> > > > >> > > >> > > > >> 2 minutes is the channel timeout. Each live connection is > > >> guaranteed > > >> > > to > > >> > > > >> have some communication for any 2 minute time window, > partially > > >> due to > > >> > > > >> periodic heartbeats (sent every 1 minute). If no packets flow > > >> for the > > >> > > > >> duration of 2 minutes, the connection is assumed broken and > all > > >> jobs > > >> > > > >> that were submitted to the respective workers are considered > > >> failed. > > >> > > So > > >> > > > >> there seems to be an issue with the connections to some of > the > > >> > > workers, > > >> > > > >> and it takes 2 minutes to detect them. > > >> > > > >> > > >> > > > >> Since the service seems to be alive (although a jstack on the > > >> service > > >> > > > >> when thing seem to hang might help), this leaves two > > >> possibilities: > > >> > > > >> 1 - some genuine network problem > > >> > > > >> 2 - the worker died without properly closing TCP connections > > >> > > > >> > > >> > > > >> If (2), you could enable worker logging > > >> > > > >> (Settings::Key::WORKER_LOGGING_LEVEL = "DEBUG") to see if > > >> anything > > >> > > shows > > >> > > > >> up. > > >> > > > >> > > >> > > > >> Mihael > > >> > > > >> > > >> > > > >> On Wed, 2014-09-03 at 20:26 -0500, Tim Armstrong wrote: > > >> > > > >> > Here are client and service logs, with part of service log > > >> edited > > >> > > down > > >> > > > >> to > > >> > > > >> > be a reasonable size (I have the full thing if needed, but > it > > >> was > > >> > > over a > > >> > > > >> > gigabyte). > > >> > > > >> > > > >> > > > >> > One relevant section is from 19:49:35 onwards. The client > > >> submits 4 > > >> > > > >> jobs > > >> > > > >> > (its limit), but they don't complete until 19:51:32 or so > (I > > >> can see > > >> > > > >> that > > >> > > > >> > one task completed based on ncompleted=1 in the > check_tasks log > > >> > > > >> message). > > >> > > > >> > It looks like something has happened with broken pipes and > > >> workers > > >> > > being > > >> > > > >> > lost, but I'm not sure what the ultimate cause of that is > > >> likely to > > >> > > be. > > >> > > > >> > > > >> > > > >> > - Tim > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > On Wed, Sep 3, 2014 at 6:20 PM, Mihael Hategan < > > >> hategan at mcs.anl.gov > > >> > > > > > >> > > > >> wrote: > > >> > > > >> > > > >> > > > >> > > Hi Tim, > > >> > > > >> > > > > >> > > > >> > > I've never seen this before with pure Java. > > >> > > > >> > > > > >> > > > >> > > Do you have logs from these runs? > > >> > > > >> > > > > >> > > > >> > > Mihael > > >> > > > >> > > > > >> > > > >> > > On Wed, 2014-09-03 at 16:49 -0500, Tim Armstrong wrote: > > >> > > > >> > > > I'm running a test Swift/T script that submit tasks to > > >> Coasters > > >> > > > >> through > > >> > > > >> > > the > > >> > > > >> > > > C++ client and I'm seeing some odd behaviour where task > > >> > > > >> > > > submission/execution is stalling for ~2 minute periods. > > >> For > > >> > > > >> example, I'm > > >> > > > >> > > > seeing submit log messages like "submitting > > >> > > > >> > > > urn:133-1409778135377-1409778135378: /bin/hostname" in > > >> bursts of > > >> > > > >> several > > >> > > > >> > > > seconds with a gap of roughly 2 minutes in between, > e.g. > > >> I'm > > >> > > seeing > > >> > > > >> > > bursts > > >> > > > >> > > > with the following intervals in my logs. > > >> > > > >> > > > > > >> > > > >> > > > 16:07:04,603 to 16:07:10,391 > > >> > > > >> > > > 16:09:07,377 to 16:09:13,076 > > >> > > > >> > > > 16:11:10,005 to 16:11:16,770 > > >> > > > >> > > > 16:13:13,291 to 16:13:19,296 > > >> > > > >> > > > 16:15:16,000 to 16:15:21,602 > > >> > > > >> > > > > > >> > > > >> > > > From what I can tell, the delay is on the coaster > service > > >> side: > > >> > > the > > >> > > > >> C > > >> > > > >> > > > client is just waiting for a response. > > >> > > > >> > > > > > >> > > > >> > > > The jobs are just being submitted through the local job > > >> > > manager, so > > >> > > > >> I > > >> > > > >> > > > wouldn't expect any delays there. The tasks are also > just > > >> > > > >> > > "/bin/hostname", > > >> > > > >> > > > so should return immediately. > > >> > > > >> > > > > > >> > > > >> > > > I'm going to continue digging into this on my own, but > the > > >> 2 > > >> > > minute > > >> > > > >> delay > > >> > > > >> > > > seems like a big clue: does anyone have an idea what > could > > >> cause > > >> > > > >> stalls > > >> > > > >> > > in > > >> > > > >> > > > task submission of 2 minute duration? > > >> > > > >> > > > > > >> > > > >> > > > Cheers, > > >> > > > >> > > > Tim > > >> > > > >> > > > > >> > > > >> > > > > >> > > > >> > > > > >> > > > >> > > >> > > > >> > > >> > > > >> > > >> > > > > > > >> > > > > >> > > > > >> > > > > >> > > >> > > >> > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Sat Sep 6 17:02:06 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 6 Sep 2014 15:02:06 -0700 Subject: [Swift-devel] calls for papers and our mailing lists Message-ID: <1410040926.5304.1.camel@echo> Hi, Do we want to receive CFPs on swift-user or swift-devel? Mihael From hategan at mcs.anl.gov Mon Sep 8 14:38:21 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Mon, 8 Sep 2014 12:38:21 -0700 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo> <1409860986.7960.3.camel@echo> <1409876838.3600.8.camel@echo> <1409939838.12288.3.camel@echo> Message-ID: <1410205101.24345.22.camel@echo> So... There were bugs. Lots of bugs. I did some work over the weekend to fix some of these and clean up the coaster code. Here's a summary: - there was some stuff in the low level coaster code to deal with persisting coaster channels over multiple connections with various options, like periodic connections, client or server initiated connections, buffering of commands, etc. None of this was used by Swift, and the code was pretty messy. I removed that. - there were some issues with multiple clients: * improper shutdown of relevant workers when a client disconnected * the worker task dispatcher was a singleton and had a reference to one block allocator, whereas multiple clients involved multiple allocators. - there were a bunch of locking issues in the C client that valgrind caught - the idea of remote job ids was a bit hard to work with. This remote id was the job id that the service assigned to a job. This is necessary because two different clients can submit jobs with the same id. The remote id would be communicated to the client as the reply to the submit request. However, it was entirely possible for a notification about job status to be sent to the client before the submit reply was. Since notifications were sent using the remote-id, the client would have no idea what job the notifications belonged to. Now, the server might need a unique job id, but there is no reason why it cannot use the client id when communicating the status to a client. So that's there now. - the way the C client was working, its jobs ended up not going to the workers, but the local queue. The service settings now allow specifying the provider/jobManager/url to be used to start blocks, and jobs are routed appropriately if they do not have the batch job flag set. I also added a shared service mode. We discussed this before. Basically you start the coaster service with "-shared " and all the settings are read from that file. In this case, all clients share the same worker pool, and client settings are ignored. The C client now has a multi-job testing tool which can submit many jobs with the desired level of concurrency. I have tested the C client with both shared and non-shared mode, with various levels of jobs being sent, with either one or two concurrent clients. I haven't tested manual workers. I've also decided that during normal operation (i.e. client connects, submits jobs, shuts down gracefully), there should be no exceptions in the coaster log. I think we should stick to that principle. This was the case last I tested, and we should consider any deviation from that to be a problem. Of course, there are some things for which there is no graceful shut down, such as ctrl+C-ing a manual worker. Exceptions are fine in that case. So anyway, let's start from here. Mihael On Fri, 2014-09-05 at 13:09 -0500, Tim Armstrong wrote: > Thanks, let me know if there's anything I can help do. > > - Tim > > > On Fri, Sep 5, 2014 at 12:57 PM, Mihael Hategan wrote: > > > Thanks. It also seems that there is an older bug in there in which the > > client connection is not properly accounted for and things start failing > > two minutes after the client connects (which is also probably why you > > didn't see this in runs with many short client connections). I'm not > > sure why the fix for that bug isn't in the trunk code. > > > > In any event, I'll set up a client submission loop and fix all these > > things. > > > > Mihael > > > > On Fri, 2014-09-05 at 12:13 -0500, Tim Armstrong wrote: > > > Ok, here it is with the additional debug messages. Source code change is > > > in commit 890c41f2ba701b10264553471590096d6f94c278. > > > > > > Warning: the tarball will expand to several gigabytes of logs. > > > > > > I had to do multiple client runs to trigger it. It seems like the > > problem > > > might be triggered by abnormal termination of the client. First 18 runs > > > went fine, problem only started when I ctrl-c-ed the swift/t run #19 > > before > > > the run #20 that exhibited delays. > > > > > > http://people.cs.uchicago.edu/~tga/files/worker-logs3.tar.gz > > > > > > - Tim > > > > > > > > > On Fri, Sep 5, 2014 at 8:55 AM, Tim Armstrong > > > > > wrote: > > > > > > > It's here: > > > > http://people.cs.uchicago.edu/~tga/files/coaster-service.out.full.gz . > > > > > > > > I'll add some extra debug messages in the coaster C++ client and see > > if I > > > > can recreate the scenario. > > > > > > > > - Tim > > > > > > > > > > > > On Thu, Sep 4, 2014 at 7:27 PM, Mihael Hategan > > > > wrote: > > > > > > > >> Ok, so that's legit. > > > >> > > > >> It does look like shut down workers are not properly accounted for in > > > >> some places (and I believe Yadu submitted a bug for this). However, I > > do > > > >> not see the dead time you mention in either of the last two sets of > > > >> logs. It looks like each client instance submits a continous stream of > > > >> jobs. > > > >> > > > >> So let's get back to the initial log. Can I have the full service log? > > > >> I'm trying to track what happened with the jobs submitted before the > > > >> first big pause. > > > >> > > > >> Also, a log message in CoasterClient::updateJobStatus() (or friends) > > > >> would probably help a lot here. > > > >> > > > >> Mihael > > > >> > > > >> On Thu, 2014-09-04 at 15:34 -0500, Tim Armstrong wrote: > > > >> > Should be here: > > > >> > > > > >> > http://people.cs.uchicago.edu/~tga/worker-logs2.tar.gz > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > On Thu, Sep 4, 2014 at 3:03 PM, Mihael Hategan > > > > > >> wrote: > > > >> > > > > >> > > The first worker "failing" is 0904-20022331. The log looks funny > > at > > > >> the > > > >> > > end. > > > >> > > > > > >> > > Can you git pull and re-run? The worker is getting some command > > at the > > > >> > > end there and doing nothing about it and I wonder why. > > > >> > > > > > >> > > Mihael > > > >> > > > > > >> > > On Thu, 2014-09-04 at 14:35 -0500, Tim Armstrong wrote: > > > >> > > > Ok, now I have some worker logs: > > > >> > > > > > > >> > > > http://people.cs.uchicago.edu/~tga/2014-9-4-worker-logs.tar.gz > > > >> > > > > > > >> > > > There's nothing obvious I see in the worker logs that would > > > >> indicate why > > > >> > > > the connection was broken. > > > >> > > > > > > >> > > > - Tim > > > >> > > > > > > >> > > > > > > >> > > > On Thu, Sep 4, 2014 at 1:11 PM, Tim Armstrong < > > > >> tim.g.armstrong at gmail.com > > > >> > > > > > > >> > > > wrote: > > > >> > > > > > > >> > > > > This is all running locally on my laptop, so I think we can > > rule > > > >> out > > > >> > > 1). > > > >> > > > > > > > >> > > > > It also seems like it's a state the coaster service gets into > > > >> after a > > > >> > > few > > > >> > > > > client sessions: generally the first coaster run works fine, > > then > > > >> > > after a > > > >> > > > > few runs the problem occurs more frequently. > > > >> > > > > > > > >> > > > > I'm going to try and get worker logs, in the meantime i've got > > > >> some > > > >> > > > > jstacks (attached). > > > >> > > > > > > > >> > > > > Matching service logs (largish) are here if needed: > > > >> > > > > http://people.cs.uchicago.edu/~tga/service.out.gz > > > >> > > > > > > > >> > > > > > > > >> > > > > On Wed, Sep 3, 2014 at 10:35 PM, Mihael Hategan < > > > >> hategan at mcs.anl.gov> > > > >> > > > > wrote: > > > >> > > > > > > > >> > > > >> Ah, makes sense. > > > >> > > > >> > > > >> > > > >> 2 minutes is the channel timeout. Each live connection is > > > >> guaranteed > > > >> > > to > > > >> > > > >> have some communication for any 2 minute time window, > > partially > > > >> due to > > > >> > > > >> periodic heartbeats (sent every 1 minute). If no packets flow > > > >> for the > > > >> > > > >> duration of 2 minutes, the connection is assumed broken and > > all > > > >> jobs > > > >> > > > >> that were submitted to the respective workers are considered > > > >> failed. > > > >> > > So > > > >> > > > >> there seems to be an issue with the connections to some of > > the > > > >> > > workers, > > > >> > > > >> and it takes 2 minutes to detect them. > > > >> > > > >> > > > >> > > > >> Since the service seems to be alive (although a jstack on the > > > >> service > > > >> > > > >> when thing seem to hang might help), this leaves two > > > >> possibilities: > > > >> > > > >> 1 - some genuine network problem > > > >> > > > >> 2 - the worker died without properly closing TCP connections > > > >> > > > >> > > > >> > > > >> If (2), you could enable worker logging > > > >> > > > >> (Settings::Key::WORKER_LOGGING_LEVEL = "DEBUG") to see if > > > >> anything > > > >> > > shows > > > >> > > > >> up. > > > >> > > > >> > > > >> > > > >> Mihael > > > >> > > > >> > > > >> > > > >> On Wed, 2014-09-03 at 20:26 -0500, Tim Armstrong wrote: > > > >> > > > >> > Here are client and service logs, with part of service log > > > >> edited > > > >> > > down > > > >> > > > >> to > > > >> > > > >> > be a reasonable size (I have the full thing if needed, but > > it > > > >> was > > > >> > > over a > > > >> > > > >> > gigabyte). > > > >> > > > >> > > > > >> > > > >> > One relevant section is from 19:49:35 onwards. The client > > > >> submits 4 > > > >> > > > >> jobs > > > >> > > > >> > (its limit), but they don't complete until 19:51:32 or so > > (I > > > >> can see > > > >> > > > >> that > > > >> > > > >> > one task completed based on ncompleted=1 in the > > check_tasks log > > > >> > > > >> message). > > > >> > > > >> > It looks like something has happened with broken pipes and > > > >> workers > > > >> > > being > > > >> > > > >> > lost, but I'm not sure what the ultimate cause of that is > > > >> likely to > > > >> > > be. > > > >> > > > >> > > > > >> > > > >> > - Tim > > > >> > > > >> > > > > >> > > > >> > > > > >> > > > >> > > > > >> > > > >> > On Wed, Sep 3, 2014 at 6:20 PM, Mihael Hategan < > > > >> hategan at mcs.anl.gov > > > >> > > > > > > >> > > > >> wrote: > > > >> > > > >> > > > > >> > > > >> > > Hi Tim, > > > >> > > > >> > > > > > >> > > > >> > > I've never seen this before with pure Java. > > > >> > > > >> > > > > > >> > > > >> > > Do you have logs from these runs? > > > >> > > > >> > > > > > >> > > > >> > > Mihael > > > >> > > > >> > > > > > >> > > > >> > > On Wed, 2014-09-03 at 16:49 -0500, Tim Armstrong wrote: > > > >> > > > >> > > > I'm running a test Swift/T script that submit tasks to > > > >> Coasters > > > >> > > > >> through > > > >> > > > >> > > the > > > >> > > > >> > > > C++ client and I'm seeing some odd behaviour where task > > > >> > > > >> > > > submission/execution is stalling for ~2 minute periods. > > > >> For > > > >> > > > >> example, I'm > > > >> > > > >> > > > seeing submit log messages like "submitting > > > >> > > > >> > > > urn:133-1409778135377-1409778135378: /bin/hostname" in > > > >> bursts of > > > >> > > > >> several > > > >> > > > >> > > > seconds with a gap of roughly 2 minutes in between, > > e.g. > > > >> I'm > > > >> > > seeing > > > >> > > > >> > > bursts > > > >> > > > >> > > > with the following intervals in my logs. > > > >> > > > >> > > > > > > >> > > > >> > > > 16:07:04,603 to 16:07:10,391 > > > >> > > > >> > > > 16:09:07,377 to 16:09:13,076 > > > >> > > > >> > > > 16:11:10,005 to 16:11:16,770 > > > >> > > > >> > > > 16:13:13,291 to 16:13:19,296 > > > >> > > > >> > > > 16:15:16,000 to 16:15:21,602 > > > >> > > > >> > > > > > > >> > > > >> > > > From what I can tell, the delay is on the coaster > > service > > > >> side: > > > >> > > the > > > >> > > > >> C > > > >> > > > >> > > > client is just waiting for a response. > > > >> > > > >> > > > > > > >> > > > >> > > > The jobs are just being submitted through the local job > > > >> > > manager, so > > > >> > > > >> I > > > >> > > > >> > > > wouldn't expect any delays there. The tasks are also > > just > > > >> > > > >> > > "/bin/hostname", > > > >> > > > >> > > > so should return immediately. > > > >> > > > >> > > > > > > >> > > > >> > > > I'm going to continue digging into this on my own, but > > the > > > >> 2 > > > >> > > minute > > > >> > > > >> delay > > > >> > > > >> > > > seems like a big clue: does anyone have an idea what > > could > > > >> cause > > > >> > > > >> stalls > > > >> > > > >> > > in > > > >> > > > >> > > > task submission of 2 minute duration? > > > >> > > > >> > > > > > > >> > > > >> > > > Cheers, > > > >> > > > >> > > > Tim > > > >> > > > >> > > > > > >> > > > >> > > > > > >> > > > >> > > > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > >> > > > >> > > > > > > > > > > From tim.g.armstrong at gmail.com Thu Sep 11 10:30:05 2014 From: tim.g.armstrong at gmail.com (Tim Armstrong) Date: Thu, 11 Sep 2014 10:30:05 -0500 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: <1410205101.24345.22.camel@echo> References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo> <1409860986.7960.3.camel@echo> <1409876838.3600.8.camel@echo> <1409939838.12288.3.camel@echo> <1410205101.24345.22.camel@echo> Message-ID: This all sounds great. Just to check that I've understood correctly, from the client's point of view: * The per-client settings behave the same if -shared is not provided. * Per-client settings are ignored if -shared is provided I had one question: * Do automatically allocated workers work with per-client settings? I understand there were some issues related to sharing workers between clients. Was the solution to have separate worker pools, or is this just not supported? - Tim On Mon, Sep 8, 2014 at 2:38 PM, Mihael Hategan wrote: > So... > > There were bugs. Lots of bugs. > I did some work over the weekend to fix some of these and clean up the > coaster code. Here's a summary: > > - there was some stuff in the low level coaster code to deal with > persisting coaster channels over multiple connections with various > options, like periodic connections, client or server initiated > connections, buffering of commands, etc. None of this was used by Swift, > and the code was pretty messy. I removed that. > - there were some issues with multiple clients: > * improper shutdown of relevant workers when a client disconnected > * the worker task dispatcher was a singleton and had a reference to > one block allocator, whereas multiple clients involved multiple > allocators. > - there were a bunch of locking issues in the C client that valgrind > caught > - the idea of remote job ids was a bit hard to work with. This remote id > was the job id that the service assigned to a job. This is necessary > because two different clients can submit jobs with the same id. The > remote id would be communicated to the client as the reply to the submit > request. However, it was entirely possible for a notification about job > status to be sent to the client before the submit reply was. Since > notifications were sent using the remote-id, the client would have no > idea what job the notifications belonged to. Now, the server might need > a unique job id, but there is no reason why it cannot use the client id > when communicating the status to a client. So that's there now. > - the way the C client was working, its jobs ended up not going to the > workers, but the local queue. The service settings now allow specifying > the provider/jobManager/url to be used to start blocks, and jobs are > routed appropriately if they do not have the batch job flag set. > > I also added a shared service mode. We discussed this before. Basically > you start the coaster service with "-shared " and > all the settings are read from that file. In this case, all clients > share the same worker pool, and client settings are ignored. > > The C client now has a multi-job testing tool which can submit many jobs > with the desired level of concurrency. > > I have tested the C client with both shared and non-shared mode, with > various levels of jobs being sent, with either one or two concurrent > clients. > > I haven't tested manual workers. > > I've also decided that during normal operation (i.e. client connects, > submits jobs, shuts down gracefully), there should be no exceptions in > the coaster log. I think we should stick to that principle. This was the > case last I tested, and we should consider any deviation from that to be > a problem. Of course, there are some things for which there is no > graceful shut down, such as ctrl+C-ing a manual worker. Exceptions are > fine in that case. > > So anyway, let's start from here. > > Mihael > > On Fri, 2014-09-05 at 13:09 -0500, Tim Armstrong wrote: > > Thanks, let me know if there's anything I can help do. > > > > - Tim > > > > > > On Fri, Sep 5, 2014 at 12:57 PM, Mihael Hategan > wrote: > > > > > Thanks. It also seems that there is an older bug in there in which the > > > client connection is not properly accounted for and things start > failing > > > two minutes after the client connects (which is also probably why you > > > didn't see this in runs with many short client connections). I'm not > > > sure why the fix for that bug isn't in the trunk code. > > > > > > In any event, I'll set up a client submission loop and fix all these > > > things. > > > > > > Mihael > > > > > > On Fri, 2014-09-05 at 12:13 -0500, Tim Armstrong wrote: > > > > Ok, here it is with the additional debug messages. Source code > change is > > > > in commit 890c41f2ba701b10264553471590096d6f94c278. > > > > > > > > Warning: the tarball will expand to several gigabytes of logs. > > > > > > > > I had to do multiple client runs to trigger it. It seems like the > > > problem > > > > might be triggered by abnormal termination of the client. First 18 > runs > > > > went fine, problem only started when I ctrl-c-ed the swift/t run #19 > > > before > > > > the run #20 that exhibited delays. > > > > > > > > http://people.cs.uchicago.edu/~tga/files/worker-logs3.tar.gz > > > > > > > > - Tim > > > > > > > > > > > > On Fri, Sep 5, 2014 at 8:55 AM, Tim Armstrong < > tim.g.armstrong at gmail.com > > > > > > > > wrote: > > > > > > > > > It's here: > > > > > > http://people.cs.uchicago.edu/~tga/files/coaster-service.out.full.gz . > > > > > > > > > > I'll add some extra debug messages in the coaster C++ client and > see > > > if I > > > > > can recreate the scenario. > > > > > > > > > > - Tim > > > > > > > > > > > > > > > On Thu, Sep 4, 2014 at 7:27 PM, Mihael Hategan < > hategan at mcs.anl.gov> > > > > > wrote: > > > > > > > > > >> Ok, so that's legit. > > > > >> > > > > >> It does look like shut down workers are not properly accounted > for in > > > > >> some places (and I believe Yadu submitted a bug for this). > However, I > > > do > > > > >> not see the dead time you mention in either of the last two sets > of > > > > >> logs. It looks like each client instance submits a continous > stream of > > > > >> jobs. > > > > >> > > > > >> So let's get back to the initial log. Can I have the full service > log? > > > > >> I'm trying to track what happened with the jobs submitted before > the > > > > >> first big pause. > > > > >> > > > > >> Also, a log message in CoasterClient::updateJobStatus() (or > friends) > > > > >> would probably help a lot here. > > > > >> > > > > >> Mihael > > > > >> > > > > >> On Thu, 2014-09-04 at 15:34 -0500, Tim Armstrong wrote: > > > > >> > Should be here: > > > > >> > > > > > >> > http://people.cs.uchicago.edu/~tga/worker-logs2.tar.gz > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > On Thu, Sep 4, 2014 at 3:03 PM, Mihael Hategan < > hategan at mcs.anl.gov > > > > > > > > >> wrote: > > > > >> > > > > > >> > > The first worker "failing" is 0904-20022331. The log looks > funny > > > at > > > > >> the > > > > >> > > end. > > > > >> > > > > > > >> > > Can you git pull and re-run? The worker is getting some > command > > > at the > > > > >> > > end there and doing nothing about it and I wonder why. > > > > >> > > > > > > >> > > Mihael > > > > >> > > > > > > >> > > On Thu, 2014-09-04 at 14:35 -0500, Tim Armstrong wrote: > > > > >> > > > Ok, now I have some worker logs: > > > > >> > > > > > > > >> > > > > http://people.cs.uchicago.edu/~tga/2014-9-4-worker-logs.tar.gz > > > > >> > > > > > > > >> > > > There's nothing obvious I see in the worker logs that would > > > > >> indicate why > > > > >> > > > the connection was broken. > > > > >> > > > > > > > >> > > > - Tim > > > > >> > > > > > > > >> > > > > > > > >> > > > On Thu, Sep 4, 2014 at 1:11 PM, Tim Armstrong < > > > > >> tim.g.armstrong at gmail.com > > > > >> > > > > > > > >> > > > wrote: > > > > >> > > > > > > > >> > > > > This is all running locally on my laptop, so I think we > can > > > rule > > > > >> out > > > > >> > > 1). > > > > >> > > > > > > > > >> > > > > It also seems like it's a state the coaster service gets > into > > > > >> after a > > > > >> > > few > > > > >> > > > > client sessions: generally the first coaster run works > fine, > > > then > > > > >> > > after a > > > > >> > > > > few runs the problem occurs more frequently. > > > > >> > > > > > > > > >> > > > > I'm going to try and get worker logs, in the meantime > i've got > > > > >> some > > > > >> > > > > jstacks (attached). > > > > >> > > > > > > > > >> > > > > Matching service logs (largish) are here if needed: > > > > >> > > > > http://people.cs.uchicago.edu/~tga/service.out.gz > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > On Wed, Sep 3, 2014 at 10:35 PM, Mihael Hategan < > > > > >> hategan at mcs.anl.gov> > > > > >> > > > > wrote: > > > > >> > > > > > > > > >> > > > >> Ah, makes sense. > > > > >> > > > >> > > > > >> > > > >> 2 minutes is the channel timeout. Each live connection is > > > > >> guaranteed > > > > >> > > to > > > > >> > > > >> have some communication for any 2 minute time window, > > > partially > > > > >> due to > > > > >> > > > >> periodic heartbeats (sent every 1 minute). If no packets > flow > > > > >> for the > > > > >> > > > >> duration of 2 minutes, the connection is assumed broken > and > > > all > > > > >> jobs > > > > >> > > > >> that were submitted to the respective workers are > considered > > > > >> failed. > > > > >> > > So > > > > >> > > > >> there seems to be an issue with the connections to some > of > > > the > > > > >> > > workers, > > > > >> > > > >> and it takes 2 minutes to detect them. > > > > >> > > > >> > > > > >> > > > >> Since the service seems to be alive (although a jstack > on the > > > > >> service > > > > >> > > > >> when thing seem to hang might help), this leaves two > > > > >> possibilities: > > > > >> > > > >> 1 - some genuine network problem > > > > >> > > > >> 2 - the worker died without properly closing TCP > connections > > > > >> > > > >> > > > > >> > > > >> If (2), you could enable worker logging > > > > >> > > > >> (Settings::Key::WORKER_LOGGING_LEVEL = "DEBUG") to see if > > > > >> anything > > > > >> > > shows > > > > >> > > > >> up. > > > > >> > > > >> > > > > >> > > > >> Mihael > > > > >> > > > >> > > > > >> > > > >> On Wed, 2014-09-03 at 20:26 -0500, Tim Armstrong wrote: > > > > >> > > > >> > Here are client and service logs, with part of service > log > > > > >> edited > > > > >> > > down > > > > >> > > > >> to > > > > >> > > > >> > be a reasonable size (I have the full thing if needed, > but > > > it > > > > >> was > > > > >> > > over a > > > > >> > > > >> > gigabyte). > > > > >> > > > >> > > > > > >> > > > >> > One relevant section is from 19:49:35 onwards. The > client > > > > >> submits 4 > > > > >> > > > >> jobs > > > > >> > > > >> > (its limit), but they don't complete until 19:51:32 or > so > > > (I > > > > >> can see > > > > >> > > > >> that > > > > >> > > > >> > one task completed based on ncompleted=1 in the > > > check_tasks log > > > > >> > > > >> message). > > > > >> > > > >> > It looks like something has happened with broken pipes > and > > > > >> workers > > > > >> > > being > > > > >> > > > >> > lost, but I'm not sure what the ultimate cause of that > is > > > > >> likely to > > > > >> > > be. > > > > >> > > > >> > > > > > >> > > > >> > - Tim > > > > >> > > > >> > > > > > >> > > > >> > > > > > >> > > > >> > > > > > >> > > > >> > On Wed, Sep 3, 2014 at 6:20 PM, Mihael Hategan < > > > > >> hategan at mcs.anl.gov > > > > >> > > > > > > > >> > > > >> wrote: > > > > >> > > > >> > > > > > >> > > > >> > > Hi Tim, > > > > >> > > > >> > > > > > > >> > > > >> > > I've never seen this before with pure Java. > > > > >> > > > >> > > > > > > >> > > > >> > > Do you have logs from these runs? > > > > >> > > > >> > > > > > > >> > > > >> > > Mihael > > > > >> > > > >> > > > > > > >> > > > >> > > On Wed, 2014-09-03 at 16:49 -0500, Tim Armstrong > wrote: > > > > >> > > > >> > > > I'm running a test Swift/T script that submit > tasks to > > > > >> Coasters > > > > >> > > > >> through > > > > >> > > > >> > > the > > > > >> > > > >> > > > C++ client and I'm seeing some odd behaviour where > task > > > > >> > > > >> > > > submission/execution is stalling for ~2 minute > periods. > > > > >> For > > > > >> > > > >> example, I'm > > > > >> > > > >> > > > seeing submit log messages like "submitting > > > > >> > > > >> > > > urn:133-1409778135377-1409778135378: > /bin/hostname" in > > > > >> bursts of > > > > >> > > > >> several > > > > >> > > > >> > > > seconds with a gap of roughly 2 minutes in between, > > > e.g. > > > > >> I'm > > > > >> > > seeing > > > > >> > > > >> > > bursts > > > > >> > > > >> > > > with the following intervals in my logs. > > > > >> > > > >> > > > > > > > >> > > > >> > > > 16:07:04,603 to 16:07:10,391 > > > > >> > > > >> > > > 16:09:07,377 to 16:09:13,076 > > > > >> > > > >> > > > 16:11:10,005 to 16:11:16,770 > > > > >> > > > >> > > > 16:13:13,291 to 16:13:19,296 > > > > >> > > > >> > > > 16:15:16,000 to 16:15:21,602 > > > > >> > > > >> > > > > > > > >> > > > >> > > > From what I can tell, the delay is on the coaster > > > service > > > > >> side: > > > > >> > > the > > > > >> > > > >> C > > > > >> > > > >> > > > client is just waiting for a response. > > > > >> > > > >> > > > > > > > >> > > > >> > > > The jobs are just being submitted through the > local job > > > > >> > > manager, so > > > > >> > > > >> I > > > > >> > > > >> > > > wouldn't expect any delays there. The tasks are > also > > > just > > > > >> > > > >> > > "/bin/hostname", > > > > >> > > > >> > > > so should return immediately. > > > > >> > > > >> > > > > > > > >> > > > >> > > > I'm going to continue digging into this on my own, > but > > > the > > > > >> 2 > > > > >> > > minute > > > > >> > > > >> delay > > > > >> > > > >> > > > seems like a big clue: does anyone have an idea > what > > > could > > > > >> cause > > > > >> > > > >> stalls > > > > >> > > > >> > > in > > > > >> > > > >> > > > task submission of 2 minute duration? > > > > >> > > > >> > > > > > > > >> > > > >> > > > Cheers, > > > > >> > > > >> > > > Tim > > > > >> > > > >> > > > > > > >> > > > >> > > > > > > >> > > > >> > > > > > > >> > > > >> > > > > >> > > > >> > > > > >> > > > >> > > > > >> > > > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > >> > > > > >> > > > > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tim.g.armstrong at gmail.com Thu Sep 11 12:16:30 2014 From: tim.g.armstrong at gmail.com (Tim Armstrong) Date: Thu, 11 Sep 2014 12:16:30 -0500 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo> <1409860986.7960.3.camel@echo> <1409876838.3600.8.camel@echo> <1409939838.12288.3.camel@echo> <1410205101.24345.22.camel@echo> Message-ID: I'm seeing failures when running Swift/T tests with start-coaster-service.sh. E.g. the turbine test coaster-exec-1. I can provide instructions for running the test if needed (roughly, you need to build Swift/T with coaster support enabled, then make tests/coaster-exec-1.result in the turbine directory). The github swift-t release is up to date if you want to use that. Full log is attached, stack trace excerpt is below. - Tim 2014-09-11 12:11:13,708-0500 INFO BlockQueueProcessor Starting... id=0911-1112130 Using threaded sender for TCPChannel [type: server, contact: 127.0.0.1:48242 ] 2014-09-11 12:11:13,708-0500 INFO AbstractStreamCoasterChannel Using threaded sender for TCPChannel [type: server, contact: 127.0.0.1:48242] org.globus.cog.coaster.channels.ChannelException: Invalid channel: null @id://null-nullS at org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) at org.globus.cog.coaster.channels.ChannelManager.reserveChannel(ChannelManager.java:226) at org.globus.cog.abstraction.coaster.service.job.manager.PassiveQueueProcessor.setClientChannelContext(PassiveQueueProcessor.java:41) at org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.setClientChannelContext(JobQueue.java:135) at org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:77) at org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) at org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) at org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) at org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) provider=local 2014-09-11 12:11:13,930-0500 INFO ExecutionTaskHandler provider=local org.globus.cog.coaster.channels.ChannelException: Invalid channel: null @id://null-nullS at org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) at org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:432) at org.globus.cog.coaster.channels.ChannelManager.reserveLongTerm(ChannelManager.java:472) at org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:80) at org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) at org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) at org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) at org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) Handler(tag: 38907, SUBMITJOB) sending error: Could not deserialize job description org.globus.cog.coaster.ProtocolException: Could not deserialize job description at org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:84) at org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) at org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) at org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) at org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) Caused by: org.globus.cog.coaster.channels.ChannelException: Invalid channel: null at id://null-nullS at org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) at org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:432) at org.globus.cog.coaster.channels.ChannelManager.reserveLongTerm(ChannelManager.java:472) at org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:80) ... 4 more 2014-09-11 12:11:13,937-0500 INFO RequestReply Handler(tag: 38907, SUBMITJOB) sending error: Could not deserialize job description org.globus.cog.coaster.ProtocolException: Could not deserialize job description at org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:84) at org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) at org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) at org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) at org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) Caused by: org.globus.cog.coaster.channels.ChannelException: Invalid channel: null at id://null-nullS at org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) at org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:432) at org.globus.cog.coaster.channels.ChannelManager.reserveLongTerm(ChannelManager.java:472) at org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:80) ... 4 more On Thu, Sep 11, 2014 at 10:30 AM, Tim Armstrong wrote: > This all sounds great. > > Just to check that I've understood correctly, from the client's point of > view: > * The per-client settings behave the same if -shared is not provided. > * Per-client settings are ignored if -shared is provided > > I had one question: > * Do automatically allocated workers work with per-client settings? I > understand there were some issues related to sharing workers between > clients. Was the solution to have separate worker pools, or is this just > not supported? > > - Tim > > On Mon, Sep 8, 2014 at 2:38 PM, Mihael Hategan > wrote: > >> So... >> >> There were bugs. Lots of bugs. >> I did some work over the weekend to fix some of these and clean up the >> coaster code. Here's a summary: >> >> - there was some stuff in the low level coaster code to deal with >> persisting coaster channels over multiple connections with various >> options, like periodic connections, client or server initiated >> connections, buffering of commands, etc. None of this was used by Swift, >> and the code was pretty messy. I removed that. >> - there were some issues with multiple clients: >> * improper shutdown of relevant workers when a client disconnected >> * the worker task dispatcher was a singleton and had a reference to >> one block allocator, whereas multiple clients involved multiple >> allocators. >> - there were a bunch of locking issues in the C client that valgrind >> caught >> - the idea of remote job ids was a bit hard to work with. This remote id >> was the job id that the service assigned to a job. This is necessary >> because two different clients can submit jobs with the same id. The >> remote id would be communicated to the client as the reply to the submit >> request. However, it was entirely possible for a notification about job >> status to be sent to the client before the submit reply was. Since >> notifications were sent using the remote-id, the client would have no >> idea what job the notifications belonged to. Now, the server might need >> a unique job id, but there is no reason why it cannot use the client id >> when communicating the status to a client. So that's there now. >> - the way the C client was working, its jobs ended up not going to the >> workers, but the local queue. The service settings now allow specifying >> the provider/jobManager/url to be used to start blocks, and jobs are >> routed appropriately if they do not have the batch job flag set. >> >> I also added a shared service mode. We discussed this before. Basically >> you start the coaster service with "-shared " and >> all the settings are read from that file. In this case, all clients >> share the same worker pool, and client settings are ignored. >> >> The C client now has a multi-job testing tool which can submit many jobs >> with the desired level of concurrency. >> >> I have tested the C client with both shared and non-shared mode, with >> various levels of jobs being sent, with either one or two concurrent >> clients. >> >> I haven't tested manual workers. >> >> I've also decided that during normal operation (i.e. client connects, >> submits jobs, shuts down gracefully), there should be no exceptions in >> the coaster log. I think we should stick to that principle. This was the >> case last I tested, and we should consider any deviation from that to be >> a problem. Of course, there are some things for which there is no >> graceful shut down, such as ctrl+C-ing a manual worker. Exceptions are >> fine in that case. >> >> So anyway, let's start from here. >> >> Mihael >> >> On Fri, 2014-09-05 at 13:09 -0500, Tim Armstrong wrote: >> > Thanks, let me know if there's anything I can help do. >> > >> > - Tim >> > >> > >> > On Fri, Sep 5, 2014 at 12:57 PM, Mihael Hategan >> wrote: >> > >> > > Thanks. It also seems that there is an older bug in there in which the >> > > client connection is not properly accounted for and things start >> failing >> > > two minutes after the client connects (which is also probably why you >> > > didn't see this in runs with many short client connections). I'm not >> > > sure why the fix for that bug isn't in the trunk code. >> > > >> > > In any event, I'll set up a client submission loop and fix all these >> > > things. >> > > >> > > Mihael >> > > >> > > On Fri, 2014-09-05 at 12:13 -0500, Tim Armstrong wrote: >> > > > Ok, here it is with the additional debug messages. Source code >> change is >> > > > in commit 890c41f2ba701b10264553471590096d6f94c278. >> > > > >> > > > Warning: the tarball will expand to several gigabytes of logs. >> > > > >> > > > I had to do multiple client runs to trigger it. It seems like the >> > > problem >> > > > might be triggered by abnormal termination of the client. First 18 >> runs >> > > > went fine, problem only started when I ctrl-c-ed the swift/t run #19 >> > > before >> > > > the run #20 that exhibited delays. >> > > > >> > > > http://people.cs.uchicago.edu/~tga/files/worker-logs3.tar.gz >> > > > >> > > > - Tim >> > > > >> > > > >> > > > On Fri, Sep 5, 2014 at 8:55 AM, Tim Armstrong < >> tim.g.armstrong at gmail.com >> > > > >> > > > wrote: >> > > > >> > > > > It's here: >> > > > > >> http://people.cs.uchicago.edu/~tga/files/coaster-service.out.full.gz . >> > > > > >> > > > > I'll add some extra debug messages in the coaster C++ client and >> see >> > > if I >> > > > > can recreate the scenario. >> > > > > >> > > > > - Tim >> > > > > >> > > > > >> > > > > On Thu, Sep 4, 2014 at 7:27 PM, Mihael Hategan < >> hategan at mcs.anl.gov> >> > > > > wrote: >> > > > > >> > > > >> Ok, so that's legit. >> > > > >> >> > > > >> It does look like shut down workers are not properly accounted >> for in >> > > > >> some places (and I believe Yadu submitted a bug for this). >> However, I >> > > do >> > > > >> not see the dead time you mention in either of the last two sets >> of >> > > > >> logs. It looks like each client instance submits a continous >> stream of >> > > > >> jobs. >> > > > >> >> > > > >> So let's get back to the initial log. Can I have the full >> service log? >> > > > >> I'm trying to track what happened with the jobs submitted before >> the >> > > > >> first big pause. >> > > > >> >> > > > >> Also, a log message in CoasterClient::updateJobStatus() (or >> friends) >> > > > >> would probably help a lot here. >> > > > >> >> > > > >> Mihael >> > > > >> >> > > > >> On Thu, 2014-09-04 at 15:34 -0500, Tim Armstrong wrote: >> > > > >> > Should be here: >> > > > >> > >> > > > >> > http://people.cs.uchicago.edu/~tga/worker-logs2.tar.gz >> > > > >> > >> > > > >> > >> > > > >> > >> > > > >> > >> > > > >> > On Thu, Sep 4, 2014 at 3:03 PM, Mihael Hategan < >> hategan at mcs.anl.gov >> > > > >> > > > >> wrote: >> > > > >> > >> > > > >> > > The first worker "failing" is 0904-20022331. The log looks >> funny >> > > at >> > > > >> the >> > > > >> > > end. >> > > > >> > > >> > > > >> > > Can you git pull and re-run? The worker is getting some >> command >> > > at the >> > > > >> > > end there and doing nothing about it and I wonder why. >> > > > >> > > >> > > > >> > > Mihael >> > > > >> > > >> > > > >> > > On Thu, 2014-09-04 at 14:35 -0500, Tim Armstrong wrote: >> > > > >> > > > Ok, now I have some worker logs: >> > > > >> > > > >> > > > >> > > > >> http://people.cs.uchicago.edu/~tga/2014-9-4-worker-logs.tar.gz >> > > > >> > > > >> > > > >> > > > There's nothing obvious I see in the worker logs that would >> > > > >> indicate why >> > > > >> > > > the connection was broken. >> > > > >> > > > >> > > > >> > > > - Tim >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > On Thu, Sep 4, 2014 at 1:11 PM, Tim Armstrong < >> > > > >> tim.g.armstrong at gmail.com >> > > > >> > > > >> > > > >> > > > wrote: >> > > > >> > > > >> > > > >> > > > > This is all running locally on my laptop, so I think we >> can >> > > rule >> > > > >> out >> > > > >> > > 1). >> > > > >> > > > > >> > > > >> > > > > It also seems like it's a state the coaster service gets >> into >> > > > >> after a >> > > > >> > > few >> > > > >> > > > > client sessions: generally the first coaster run works >> fine, >> > > then >> > > > >> > > after a >> > > > >> > > > > few runs the problem occurs more frequently. >> > > > >> > > > > >> > > > >> > > > > I'm going to try and get worker logs, in the meantime >> i've got >> > > > >> some >> > > > >> > > > > jstacks (attached). >> > > > >> > > > > >> > > > >> > > > > Matching service logs (largish) are here if needed: >> > > > >> > > > > http://people.cs.uchicago.edu/~tga/service.out.gz >> > > > >> > > > > >> > > > >> > > > > >> > > > >> > > > > On Wed, Sep 3, 2014 at 10:35 PM, Mihael Hategan < >> > > > >> hategan at mcs.anl.gov> >> > > > >> > > > > wrote: >> > > > >> > > > > >> > > > >> > > > >> Ah, makes sense. >> > > > >> > > > >> >> > > > >> > > > >> 2 minutes is the channel timeout. Each live connection >> is >> > > > >> guaranteed >> > > > >> > > to >> > > > >> > > > >> have some communication for any 2 minute time window, >> > > partially >> > > > >> due to >> > > > >> > > > >> periodic heartbeats (sent every 1 minute). If no >> packets flow >> > > > >> for the >> > > > >> > > > >> duration of 2 minutes, the connection is assumed broken >> and >> > > all >> > > > >> jobs >> > > > >> > > > >> that were submitted to the respective workers are >> considered >> > > > >> failed. >> > > > >> > > So >> > > > >> > > > >> there seems to be an issue with the connections to some >> of >> > > the >> > > > >> > > workers, >> > > > >> > > > >> and it takes 2 minutes to detect them. >> > > > >> > > > >> >> > > > >> > > > >> Since the service seems to be alive (although a jstack >> on the >> > > > >> service >> > > > >> > > > >> when thing seem to hang might help), this leaves two >> > > > >> possibilities: >> > > > >> > > > >> 1 - some genuine network problem >> > > > >> > > > >> 2 - the worker died without properly closing TCP >> connections >> > > > >> > > > >> >> > > > >> > > > >> If (2), you could enable worker logging >> > > > >> > > > >> (Settings::Key::WORKER_LOGGING_LEVEL = "DEBUG") to see >> if >> > > > >> anything >> > > > >> > > shows >> > > > >> > > > >> up. >> > > > >> > > > >> >> > > > >> > > > >> Mihael >> > > > >> > > > >> >> > > > >> > > > >> On Wed, 2014-09-03 at 20:26 -0500, Tim Armstrong wrote: >> > > > >> > > > >> > Here are client and service logs, with part of >> service log >> > > > >> edited >> > > > >> > > down >> > > > >> > > > >> to >> > > > >> > > > >> > be a reasonable size (I have the full thing if >> needed, but >> > > it >> > > > >> was >> > > > >> > > over a >> > > > >> > > > >> > gigabyte). >> > > > >> > > > >> > >> > > > >> > > > >> > One relevant section is from 19:49:35 onwards. The >> client >> > > > >> submits 4 >> > > > >> > > > >> jobs >> > > > >> > > > >> > (its limit), but they don't complete until 19:51:32 >> or so >> > > (I >> > > > >> can see >> > > > >> > > > >> that >> > > > >> > > > >> > one task completed based on ncompleted=1 in the >> > > check_tasks log >> > > > >> > > > >> message). >> > > > >> > > > >> > It looks like something has happened with broken >> pipes and >> > > > >> workers >> > > > >> > > being >> > > > >> > > > >> > lost, but I'm not sure what the ultimate cause of >> that is >> > > > >> likely to >> > > > >> > > be. >> > > > >> > > > >> > >> > > > >> > > > >> > - Tim >> > > > >> > > > >> > >> > > > >> > > > >> > >> > > > >> > > > >> > >> > > > >> > > > >> > On Wed, Sep 3, 2014 at 6:20 PM, Mihael Hategan < >> > > > >> hategan at mcs.anl.gov >> > > > >> > > > >> > > > >> > > > >> wrote: >> > > > >> > > > >> > >> > > > >> > > > >> > > Hi Tim, >> > > > >> > > > >> > > >> > > > >> > > > >> > > I've never seen this before with pure Java. >> > > > >> > > > >> > > >> > > > >> > > > >> > > Do you have logs from these runs? >> > > > >> > > > >> > > >> > > > >> > > > >> > > Mihael >> > > > >> > > > >> > > >> > > > >> > > > >> > > On Wed, 2014-09-03 at 16:49 -0500, Tim Armstrong >> wrote: >> > > > >> > > > >> > > > I'm running a test Swift/T script that submit >> tasks to >> > > > >> Coasters >> > > > >> > > > >> through >> > > > >> > > > >> > > the >> > > > >> > > > >> > > > C++ client and I'm seeing some odd behaviour >> where task >> > > > >> > > > >> > > > submission/execution is stalling for ~2 minute >> periods. >> > > > >> For >> > > > >> > > > >> example, I'm >> > > > >> > > > >> > > > seeing submit log messages like "submitting >> > > > >> > > > >> > > > urn:133-1409778135377-1409778135378: >> /bin/hostname" in >> > > > >> bursts of >> > > > >> > > > >> several >> > > > >> > > > >> > > > seconds with a gap of roughly 2 minutes in >> between, >> > > e.g. >> > > > >> I'm >> > > > >> > > seeing >> > > > >> > > > >> > > bursts >> > > > >> > > > >> > > > with the following intervals in my logs. >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > 16:07:04,603 to 16:07:10,391 >> > > > >> > > > >> > > > 16:09:07,377 to 16:09:13,076 >> > > > >> > > > >> > > > 16:11:10,005 to 16:11:16,770 >> > > > >> > > > >> > > > 16:13:13,291 to 16:13:19,296 >> > > > >> > > > >> > > > 16:15:16,000 to 16:15:21,602 >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > From what I can tell, the delay is on the coaster >> > > service >> > > > >> side: >> > > > >> > > the >> > > > >> > > > >> C >> > > > >> > > > >> > > > client is just waiting for a response. >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > The jobs are just being submitted through the >> local job >> > > > >> > > manager, so >> > > > >> > > > >> I >> > > > >> > > > >> > > > wouldn't expect any delays there. The tasks are >> also >> > > just >> > > > >> > > > >> > > "/bin/hostname", >> > > > >> > > > >> > > > so should return immediately. >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > I'm going to continue digging into this on my >> own, but >> > > the >> > > > >> 2 >> > > > >> > > minute >> > > > >> > > > >> delay >> > > > >> > > > >> > > > seems like a big clue: does anyone have an idea >> what >> > > could >> > > > >> cause >> > > > >> > > > >> stalls >> > > > >> > > > >> > > in >> > > > >> > > > >> > > > task submission of 2 minute duration? >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > Cheers, >> > > > >> > > > >> > > > Tim >> > > > >> > > > >> > > >> > > > >> > > > >> > > >> > > > >> > > > >> > > >> > > > >> > > > >> >> > > > >> > > > >> >> > > > >> > > > >> >> > > > >> > > > > >> > > > >> > > >> > > > >> > > >> > > > >> > > >> > > > >> >> > > > >> >> > > > >> >> > > > > >> > > >> > > >> > > >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: start-coaster-service.log.gz Type: application/x-gzip Size: 16371 bytes Desc: not available URL: From hategan at mcs.anl.gov Thu Sep 11 12:37:32 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 11 Sep 2014 10:37:32 -0700 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo> <1409860986.7960.3.camel@echo> <1409876838.3600.8.camel@echo> <1409939838.12288.3.camel@echo> <1410205101.24345.22.camel@echo> Message-ID: <1410457052.25856.3.camel@echo> On Thu, 2014-09-11 at 10:30 -0500, Tim Armstrong wrote: > This all sounds great. > > Just to check that I've understood correctly, from the client's point of > view: > * The per-client settings behave the same if -shared is not provided. Yes. > * Per-client settings are ignored if -shared is provided Yes. You need to send the init command though to get a config id. > > I had one question: > * Do automatically allocated workers work with per-client settings? It was supposed to work before and it is now (according to my testing). > I understand there were some issues related to sharing workers between > clients. Was the solution to have separate worker pools, or is this just > not supported? With -shared there is one worker pool and one set of settings. Without -shared, each client gets a worker pool (with its own settings). The issue wasn't conceptual. Just poor code-writing on my part. Mihael From hategan at mcs.anl.gov Thu Sep 11 12:39:33 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 11 Sep 2014 10:39:33 -0700 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo> <1409860986.7960.3.camel@echo> <1409876838.3600.8.camel@echo> <1409939838.12288.3.camel@echo> <1410205101.24345.22.camel@echo> Message-ID: <1410457173.25856.5.camel@echo> The method "getMetaChannel()" has been removed. Where did you get the code from? Mihael On Thu, 2014-09-11 at 12:16 -0500, Tim Armstrong wrote: > I'm seeing failures when running Swift/T tests with > start-coaster-service.sh. > > E.g. the turbine test coaster-exec-1. I can provide instructions for > running the test if needed (roughly, you need to build Swift/T with coaster > support enabled, then make tests/coaster-exec-1.result in the turbine > directory). The github swift-t release is up to date if you want to use > that. > > Full log is attached, stack trace excerpt is below. > > - Tim > > 2014-09-11 12:11:13,708-0500 INFO BlockQueueProcessor Starting... > id=0911-1112130 > Using threaded sender for TCPChannel [type: server, contact: 127.0.0.1:48242 > ] > 2014-09-11 12:11:13,708-0500 INFO AbstractStreamCoasterChannel Using > threaded sender for TCPChannel [type: server, contact: 127.0.0.1:48242] > org.globus.cog.coaster.channels.ChannelException: Invalid channel: null > @id://null-nullS > at > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) > at > org.globus.cog.coaster.channels.ChannelManager.reserveChannel(ChannelManager.java:226) > at > org.globus.cog.abstraction.coaster.service.job.manager.PassiveQueueProcessor.setClientChannelContext(PassiveQueueProcessor.java:41) > at > org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.setClientChannelContext(JobQueue.java:135) > at > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:77) > at > org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) > at > org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) > at > org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) > at org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) > provider=local > 2014-09-11 12:11:13,930-0500 INFO ExecutionTaskHandler provider=local > org.globus.cog.coaster.channels.ChannelException: Invalid channel: null > @id://null-nullS > at > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) > at > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:432) > at > org.globus.cog.coaster.channels.ChannelManager.reserveLongTerm(ChannelManager.java:472) > at > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:80) > at > org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) > at > org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) > at > org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) > at org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) > Handler(tag: 38907, SUBMITJOB) sending error: Could not deserialize job > description > org.globus.cog.coaster.ProtocolException: Could not deserialize job > description > at > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:84) > at > org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) > at > org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) > at > org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) > at org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) > Caused by: org.globus.cog.coaster.channels.ChannelException: Invalid > channel: null at id://null-nullS > at > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) > at > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:432) > at > org.globus.cog.coaster.channels.ChannelManager.reserveLongTerm(ChannelManager.java:472) > at > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:80) > ... 4 more > 2014-09-11 12:11:13,937-0500 INFO RequestReply Handler(tag: 38907, > SUBMITJOB) sending error: Could not deserialize job description > org.globus.cog.coaster.ProtocolException: Could not deserialize job > description > at > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:84) > at > org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) > at > org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) > at > org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) > at org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) > Caused by: org.globus.cog.coaster.channels.ChannelException: Invalid > channel: null at id://null-nullS > at > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) > at > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:432) > at > org.globus.cog.coaster.channels.ChannelManager.reserveLongTerm(ChannelManager.java:472) > at > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:80) > ... 4 more > > > On Thu, Sep 11, 2014 at 10:30 AM, Tim Armstrong > wrote: > > > This all sounds great. > > > > Just to check that I've understood correctly, from the client's point of > > view: > > * The per-client settings behave the same if -shared is not provided. > > * Per-client settings are ignored if -shared is provided > > > > I had one question: > > * Do automatically allocated workers work with per-client settings? I > > understand there were some issues related to sharing workers between > > clients. Was the solution to have separate worker pools, or is this just > > not supported? > > > > - Tim > > > > On Mon, Sep 8, 2014 at 2:38 PM, Mihael Hategan > > wrote: > > > >> So... > >> > >> There were bugs. Lots of bugs. > >> I did some work over the weekend to fix some of these and clean up the > >> coaster code. Here's a summary: > >> > >> - there was some stuff in the low level coaster code to deal with > >> persisting coaster channels over multiple connections with various > >> options, like periodic connections, client or server initiated > >> connections, buffering of commands, etc. None of this was used by Swift, > >> and the code was pretty messy. I removed that. > >> - there were some issues with multiple clients: > >> * improper shutdown of relevant workers when a client disconnected > >> * the worker task dispatcher was a singleton and had a reference to > >> one block allocator, whereas multiple clients involved multiple > >> allocators. > >> - there were a bunch of locking issues in the C client that valgrind > >> caught > >> - the idea of remote job ids was a bit hard to work with. This remote id > >> was the job id that the service assigned to a job. This is necessary > >> because two different clients can submit jobs with the same id. The > >> remote id would be communicated to the client as the reply to the submit > >> request. However, it was entirely possible for a notification about job > >> status to be sent to the client before the submit reply was. Since > >> notifications were sent using the remote-id, the client would have no > >> idea what job the notifications belonged to. Now, the server might need > >> a unique job id, but there is no reason why it cannot use the client id > >> when communicating the status to a client. So that's there now. > >> - the way the C client was working, its jobs ended up not going to the > >> workers, but the local queue. The service settings now allow specifying > >> the provider/jobManager/url to be used to start blocks, and jobs are > >> routed appropriately if they do not have the batch job flag set. > >> > >> I also added a shared service mode. We discussed this before. Basically > >> you start the coaster service with "-shared " and > >> all the settings are read from that file. In this case, all clients > >> share the same worker pool, and client settings are ignored. > >> > >> The C client now has a multi-job testing tool which can submit many jobs > >> with the desired level of concurrency. > >> > >> I have tested the C client with both shared and non-shared mode, with > >> various levels of jobs being sent, with either one or two concurrent > >> clients. > >> > >> I haven't tested manual workers. > >> > >> I've also decided that during normal operation (i.e. client connects, > >> submits jobs, shuts down gracefully), there should be no exceptions in > >> the coaster log. I think we should stick to that principle. This was the > >> case last I tested, and we should consider any deviation from that to be > >> a problem. Of course, there are some things for which there is no > >> graceful shut down, such as ctrl+C-ing a manual worker. Exceptions are > >> fine in that case. > >> > >> So anyway, let's start from here. > >> > >> Mihael > >> > >> On Fri, 2014-09-05 at 13:09 -0500, Tim Armstrong wrote: > >> > Thanks, let me know if there's anything I can help do. > >> > > >> > - Tim > >> > > >> > > >> > On Fri, Sep 5, 2014 at 12:57 PM, Mihael Hategan > >> wrote: > >> > > >> > > Thanks. It also seems that there is an older bug in there in which the > >> > > client connection is not properly accounted for and things start > >> failing > >> > > two minutes after the client connects (which is also probably why you > >> > > didn't see this in runs with many short client connections). I'm not > >> > > sure why the fix for that bug isn't in the trunk code. > >> > > > >> > > In any event, I'll set up a client submission loop and fix all these > >> > > things. > >> > > > >> > > Mihael > >> > > > >> > > On Fri, 2014-09-05 at 12:13 -0500, Tim Armstrong wrote: > >> > > > Ok, here it is with the additional debug messages. Source code > >> change is > >> > > > in commit 890c41f2ba701b10264553471590096d6f94c278. > >> > > > > >> > > > Warning: the tarball will expand to several gigabytes of logs. > >> > > > > >> > > > I had to do multiple client runs to trigger it. It seems like the > >> > > problem > >> > > > might be triggered by abnormal termination of the client. First 18 > >> runs > >> > > > went fine, problem only started when I ctrl-c-ed the swift/t run #19 > >> > > before > >> > > > the run #20 that exhibited delays. > >> > > > > >> > > > http://people.cs.uchicago.edu/~tga/files/worker-logs3.tar.gz > >> > > > > >> > > > - Tim > >> > > > > >> > > > > >> > > > On Fri, Sep 5, 2014 at 8:55 AM, Tim Armstrong < > >> tim.g.armstrong at gmail.com > >> > > > > >> > > > wrote: > >> > > > > >> > > > > It's here: > >> > > > > > >> http://people.cs.uchicago.edu/~tga/files/coaster-service.out.full.gz . > >> > > > > > >> > > > > I'll add some extra debug messages in the coaster C++ client and > >> see > >> > > if I > >> > > > > can recreate the scenario. > >> > > > > > >> > > > > - Tim > >> > > > > > >> > > > > > >> > > > > On Thu, Sep 4, 2014 at 7:27 PM, Mihael Hategan < > >> hategan at mcs.anl.gov> > >> > > > > wrote: > >> > > > > > >> > > > >> Ok, so that's legit. > >> > > > >> > >> > > > >> It does look like shut down workers are not properly accounted > >> for in > >> > > > >> some places (and I believe Yadu submitted a bug for this). > >> However, I > >> > > do > >> > > > >> not see the dead time you mention in either of the last two sets > >> of > >> > > > >> logs. It looks like each client instance submits a continous > >> stream of > >> > > > >> jobs. > >> > > > >> > >> > > > >> So let's get back to the initial log. Can I have the full > >> service log? > >> > > > >> I'm trying to track what happened with the jobs submitted before > >> the > >> > > > >> first big pause. > >> > > > >> > >> > > > >> Also, a log message in CoasterClient::updateJobStatus() (or > >> friends) > >> > > > >> would probably help a lot here. > >> > > > >> > >> > > > >> Mihael > >> > > > >> > >> > > > >> On Thu, 2014-09-04 at 15:34 -0500, Tim Armstrong wrote: > >> > > > >> > Should be here: > >> > > > >> > > >> > > > >> > http://people.cs.uchicago.edu/~tga/worker-logs2.tar.gz > >> > > > >> > > >> > > > >> > > >> > > > >> > > >> > > > >> > > >> > > > >> > On Thu, Sep 4, 2014 at 3:03 PM, Mihael Hategan < > >> hategan at mcs.anl.gov > >> > > > > >> > > > >> wrote: > >> > > > >> > > >> > > > >> > > The first worker "failing" is 0904-20022331. The log looks > >> funny > >> > > at > >> > > > >> the > >> > > > >> > > end. > >> > > > >> > > > >> > > > >> > > Can you git pull and re-run? The worker is getting some > >> command > >> > > at the > >> > > > >> > > end there and doing nothing about it and I wonder why. > >> > > > >> > > > >> > > > >> > > Mihael > >> > > > >> > > > >> > > > >> > > On Thu, 2014-09-04 at 14:35 -0500, Tim Armstrong wrote: > >> > > > >> > > > Ok, now I have some worker logs: > >> > > > >> > > > > >> > > > >> > > > > >> http://people.cs.uchicago.edu/~tga/2014-9-4-worker-logs.tar.gz > >> > > > >> > > > > >> > > > >> > > > There's nothing obvious I see in the worker logs that would > >> > > > >> indicate why > >> > > > >> > > > the connection was broken. > >> > > > >> > > > > >> > > > >> > > > - Tim > >> > > > >> > > > > >> > > > >> > > > > >> > > > >> > > > On Thu, Sep 4, 2014 at 1:11 PM, Tim Armstrong < > >> > > > >> tim.g.armstrong at gmail.com > >> > > > >> > > > > >> > > > >> > > > wrote: > >> > > > >> > > > > >> > > > >> > > > > This is all running locally on my laptop, so I think we > >> can > >> > > rule > >> > > > >> out > >> > > > >> > > 1). > >> > > > >> > > > > > >> > > > >> > > > > It also seems like it's a state the coaster service gets > >> into > >> > > > >> after a > >> > > > >> > > few > >> > > > >> > > > > client sessions: generally the first coaster run works > >> fine, > >> > > then > >> > > > >> > > after a > >> > > > >> > > > > few runs the problem occurs more frequently. > >> > > > >> > > > > > >> > > > >> > > > > I'm going to try and get worker logs, in the meantime > >> i've got > >> > > > >> some > >> > > > >> > > > > jstacks (attached). > >> > > > >> > > > > > >> > > > >> > > > > Matching service logs (largish) are here if needed: > >> > > > >> > > > > http://people.cs.uchicago.edu/~tga/service.out.gz > >> > > > >> > > > > > >> > > > >> > > > > > >> > > > >> > > > > On Wed, Sep 3, 2014 at 10:35 PM, Mihael Hategan < > >> > > > >> hategan at mcs.anl.gov> > >> > > > >> > > > > wrote: > >> > > > >> > > > > > >> > > > >> > > > >> Ah, makes sense. > >> > > > >> > > > >> > >> > > > >> > > > >> 2 minutes is the channel timeout. Each live connection > >> is > >> > > > >> guaranteed > >> > > > >> > > to > >> > > > >> > > > >> have some communication for any 2 minute time window, > >> > > partially > >> > > > >> due to > >> > > > >> > > > >> periodic heartbeats (sent every 1 minute). If no > >> packets flow > >> > > > >> for the > >> > > > >> > > > >> duration of 2 minutes, the connection is assumed broken > >> and > >> > > all > >> > > > >> jobs > >> > > > >> > > > >> that were submitted to the respective workers are > >> considered > >> > > > >> failed. > >> > > > >> > > So > >> > > > >> > > > >> there seems to be an issue with the connections to some > >> of > >> > > the > >> > > > >> > > workers, > >> > > > >> > > > >> and it takes 2 minutes to detect them. > >> > > > >> > > > >> > >> > > > >> > > > >> Since the service seems to be alive (although a jstack > >> on the > >> > > > >> service > >> > > > >> > > > >> when thing seem to hang might help), this leaves two > >> > > > >> possibilities: > >> > > > >> > > > >> 1 - some genuine network problem > >> > > > >> > > > >> 2 - the worker died without properly closing TCP > >> connections > >> > > > >> > > > >> > >> > > > >> > > > >> If (2), you could enable worker logging > >> > > > >> > > > >> (Settings::Key::WORKER_LOGGING_LEVEL = "DEBUG") to see > >> if > >> > > > >> anything > >> > > > >> > > shows > >> > > > >> > > > >> up. > >> > > > >> > > > >> > >> > > > >> > > > >> Mihael > >> > > > >> > > > >> > >> > > > >> > > > >> On Wed, 2014-09-03 at 20:26 -0500, Tim Armstrong wrote: > >> > > > >> > > > >> > Here are client and service logs, with part of > >> service log > >> > > > >> edited > >> > > > >> > > down > >> > > > >> > > > >> to > >> > > > >> > > > >> > be a reasonable size (I have the full thing if > >> needed, but > >> > > it > >> > > > >> was > >> > > > >> > > over a > >> > > > >> > > > >> > gigabyte). > >> > > > >> > > > >> > > >> > > > >> > > > >> > One relevant section is from 19:49:35 onwards. The > >> client > >> > > > >> submits 4 > >> > > > >> > > > >> jobs > >> > > > >> > > > >> > (its limit), but they don't complete until 19:51:32 > >> or so > >> > > (I > >> > > > >> can see > >> > > > >> > > > >> that > >> > > > >> > > > >> > one task completed based on ncompleted=1 in the > >> > > check_tasks log > >> > > > >> > > > >> message). > >> > > > >> > > > >> > It looks like something has happened with broken > >> pipes and > >> > > > >> workers > >> > > > >> > > being > >> > > > >> > > > >> > lost, but I'm not sure what the ultimate cause of > >> that is > >> > > > >> likely to > >> > > > >> > > be. > >> > > > >> > > > >> > > >> > > > >> > > > >> > - Tim > >> > > > >> > > > >> > > >> > > > >> > > > >> > > >> > > > >> > > > >> > > >> > > > >> > > > >> > On Wed, Sep 3, 2014 at 6:20 PM, Mihael Hategan < > >> > > > >> hategan at mcs.anl.gov > >> > > > >> > > > > >> > > > >> > > > >> wrote: > >> > > > >> > > > >> > > >> > > > >> > > > >> > > Hi Tim, > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > I've never seen this before with pure Java. > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > Do you have logs from these runs? > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > Mihael > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > On Wed, 2014-09-03 at 16:49 -0500, Tim Armstrong > >> wrote: > >> > > > >> > > > >> > > > I'm running a test Swift/T script that submit > >> tasks to > >> > > > >> Coasters > >> > > > >> > > > >> through > >> > > > >> > > > >> > > the > >> > > > >> > > > >> > > > C++ client and I'm seeing some odd behaviour > >> where task > >> > > > >> > > > >> > > > submission/execution is stalling for ~2 minute > >> periods. > >> > > > >> For > >> > > > >> > > > >> example, I'm > >> > > > >> > > > >> > > > seeing submit log messages like "submitting > >> > > > >> > > > >> > > > urn:133-1409778135377-1409778135378: > >> /bin/hostname" in > >> > > > >> bursts of > >> > > > >> > > > >> several > >> > > > >> > > > >> > > > seconds with a gap of roughly 2 minutes in > >> between, > >> > > e.g. > >> > > > >> I'm > >> > > > >> > > seeing > >> > > > >> > > > >> > > bursts > >> > > > >> > > > >> > > > with the following intervals in my logs. > >> > > > >> > > > >> > > > > >> > > > >> > > > >> > > > 16:07:04,603 to 16:07:10,391 > >> > > > >> > > > >> > > > 16:09:07,377 to 16:09:13,076 > >> > > > >> > > > >> > > > 16:11:10,005 to 16:11:16,770 > >> > > > >> > > > >> > > > 16:13:13,291 to 16:13:19,296 > >> > > > >> > > > >> > > > 16:15:16,000 to 16:15:21,602 > >> > > > >> > > > >> > > > > >> > > > >> > > > >> > > > From what I can tell, the delay is on the coaster > >> > > service > >> > > > >> side: > >> > > > >> > > the > >> > > > >> > > > >> C > >> > > > >> > > > >> > > > client is just waiting for a response. > >> > > > >> > > > >> > > > > >> > > > >> > > > >> > > > The jobs are just being submitted through the > >> local job > >> > > > >> > > manager, so > >> > > > >> > > > >> I > >> > > > >> > > > >> > > > wouldn't expect any delays there. The tasks are > >> also > >> > > just > >> > > > >> > > > >> > > "/bin/hostname", > >> > > > >> > > > >> > > > so should return immediately. > >> > > > >> > > > >> > > > > >> > > > >> > > > >> > > > I'm going to continue digging into this on my > >> own, but > >> > > the > >> > > > >> 2 > >> > > > >> > > minute > >> > > > >> > > > >> delay > >> > > > >> > > > >> > > > seems like a big clue: does anyone have an idea > >> what > >> > > could > >> > > > >> cause > >> > > > >> > > > >> stalls > >> > > > >> > > > >> > > in > >> > > > >> > > > >> > > > task submission of 2 minute duration? > >> > > > >> > > > >> > > > > >> > > > >> > > > >> > > > Cheers, > >> > > > >> > > > >> > > > Tim > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > >> > > > >> > > > >> > >> > > > >> > > > >> > >> > > > >> > > > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > >> > > > >> > >> > > > >> > >> > > > > > >> > > > >> > > > >> > > > >> > >> > >> > > From tim.g.armstrong at gmail.com Thu Sep 11 12:41:17 2014 From: tim.g.armstrong at gmail.com (Tim Armstrong) Date: Thu, 11 Sep 2014 12:41:17 -0500 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: <1410457173.25856.5.camel@echo> References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo> <1409860986.7960.3.camel@echo> <1409876838.3600.8.camel@echo> <1409939838.12288.3.camel@echo> <1410205101.24345.22.camel@echo> <1410457173.25856.5.camel@echo> Message-ID: I thought I was running the latest trunk, I'll rebuild and see if I can reproduce the issue. - Tim On Thu, Sep 11, 2014 at 12:39 PM, Mihael Hategan wrote: > The method "getMetaChannel()" has been removed. Where did you get the > code from? > > Mihael > > On Thu, 2014-09-11 at 12:16 -0500, Tim Armstrong wrote: > > I'm seeing failures when running Swift/T tests with > > start-coaster-service.sh. > > > > E.g. the turbine test coaster-exec-1. I can provide instructions for > > running the test if needed (roughly, you need to build Swift/T with > coaster > > support enabled, then make tests/coaster-exec-1.result in the turbine > > directory). The github swift-t release is up to date if you want to use > > that. > > > > Full log is attached, stack trace excerpt is below. > > > > - Tim > > > > 2014-09-11 12:11:13,708-0500 INFO BlockQueueProcessor Starting... > > id=0911-1112130 > > Using threaded sender for TCPChannel [type: server, contact: > 127.0.0.1:48242 > > ] > > 2014-09-11 12:11:13,708-0500 INFO AbstractStreamCoasterChannel Using > > threaded sender for TCPChannel [type: server, contact: 127.0.0.1:48242] > > org.globus.cog.coaster.channels.ChannelException: Invalid channel: null > > @id://null-nullS > > at > > > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) > > at > > > org.globus.cog.coaster.channels.ChannelManager.reserveChannel(ChannelManager.java:226) > > at > > > org.globus.cog.abstraction.coaster.service.job.manager.PassiveQueueProcessor.setClientChannelContext(PassiveQueueProcessor.java:41) > > at > > > org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.setClientChannelContext(JobQueue.java:135) > > at > > > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:77) > > at > > > org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) > > at > > > org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) > > at > > > org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) > > at > org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) > > provider=local > > 2014-09-11 12:11:13,930-0500 INFO ExecutionTaskHandler provider=local > > org.globus.cog.coaster.channels.ChannelException: Invalid channel: null > > @id://null-nullS > > at > > > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) > > at > > > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:432) > > at > > > org.globus.cog.coaster.channels.ChannelManager.reserveLongTerm(ChannelManager.java:472) > > at > > > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:80) > > at > > > org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) > > at > > > org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) > > at > > > org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) > > at > org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) > > Handler(tag: 38907, SUBMITJOB) sending error: Could not deserialize job > > description > > org.globus.cog.coaster.ProtocolException: Could not deserialize job > > description > > at > > > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:84) > > at > > > org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) > > at > > > org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) > > at > > > org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) > > at > org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) > > Caused by: org.globus.cog.coaster.channels.ChannelException: Invalid > > channel: null at id://null-nullS > > at > > > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) > > at > > > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:432) > > at > > > org.globus.cog.coaster.channels.ChannelManager.reserveLongTerm(ChannelManager.java:472) > > at > > > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:80) > > ... 4 more > > 2014-09-11 12:11:13,937-0500 INFO RequestReply Handler(tag: 38907, > > SUBMITJOB) sending error: Could not deserialize job description > > org.globus.cog.coaster.ProtocolException: Could not deserialize job > > description > > at > > > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:84) > > at > > > org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) > > at > > > org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) > > at > > > org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) > > at > org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) > > Caused by: org.globus.cog.coaster.channels.ChannelException: Invalid > > channel: null at id://null-nullS > > at > > > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) > > at > > > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:432) > > at > > > org.globus.cog.coaster.channels.ChannelManager.reserveLongTerm(ChannelManager.java:472) > > at > > > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:80) > > ... 4 more > > > > > > On Thu, Sep 11, 2014 at 10:30 AM, Tim Armstrong < > tim.g.armstrong at gmail.com> > > wrote: > > > > > This all sounds great. > > > > > > Just to check that I've understood correctly, from the client's point > of > > > view: > > > * The per-client settings behave the same if -shared is not provided. > > > * Per-client settings are ignored if -shared is provided > > > > > > I had one question: > > > * Do automatically allocated workers work with per-client settings? I > > > understand there were some issues related to sharing workers between > > > clients. Was the solution to have separate worker pools, or is this > just > > > not supported? > > > > > > - Tim > > > > > > On Mon, Sep 8, 2014 at 2:38 PM, Mihael Hategan > > > wrote: > > > > > >> So... > > >> > > >> There were bugs. Lots of bugs. > > >> I did some work over the weekend to fix some of these and clean up the > > >> coaster code. Here's a summary: > > >> > > >> - there was some stuff in the low level coaster code to deal with > > >> persisting coaster channels over multiple connections with various > > >> options, like periodic connections, client or server initiated > > >> connections, buffering of commands, etc. None of this was used by > Swift, > > >> and the code was pretty messy. I removed that. > > >> - there were some issues with multiple clients: > > >> * improper shutdown of relevant workers when a client disconnected > > >> * the worker task dispatcher was a singleton and had a reference to > > >> one block allocator, whereas multiple clients involved multiple > > >> allocators. > > >> - there were a bunch of locking issues in the C client that valgrind > > >> caught > > >> - the idea of remote job ids was a bit hard to work with. This remote > id > > >> was the job id that the service assigned to a job. This is necessary > > >> because two different clients can submit jobs with the same id. The > > >> remote id would be communicated to the client as the reply to the > submit > > >> request. However, it was entirely possible for a notification about > job > > >> status to be sent to the client before the submit reply was. Since > > >> notifications were sent using the remote-id, the client would have no > > >> idea what job the notifications belonged to. Now, the server might > need > > >> a unique job id, but there is no reason why it cannot use the client > id > > >> when communicating the status to a client. So that's there now. > > >> - the way the C client was working, its jobs ended up not going to the > > >> workers, but the local queue. The service settings now allow > specifying > > >> the provider/jobManager/url to be used to start blocks, and jobs are > > >> routed appropriately if they do not have the batch job flag set. > > >> > > >> I also added a shared service mode. We discussed this before. > Basically > > >> you start the coaster service with "-shared " and > > >> all the settings are read from that file. In this case, all clients > > >> share the same worker pool, and client settings are ignored. > > >> > > >> The C client now has a multi-job testing tool which can submit many > jobs > > >> with the desired level of concurrency. > > >> > > >> I have tested the C client with both shared and non-shared mode, with > > >> various levels of jobs being sent, with either one or two concurrent > > >> clients. > > >> > > >> I haven't tested manual workers. > > >> > > >> I've also decided that during normal operation (i.e. client connects, > > >> submits jobs, shuts down gracefully), there should be no exceptions in > > >> the coaster log. I think we should stick to that principle. This was > the > > >> case last I tested, and we should consider any deviation from that to > be > > >> a problem. Of course, there are some things for which there is no > > >> graceful shut down, such as ctrl+C-ing a manual worker. Exceptions are > > >> fine in that case. > > >> > > >> So anyway, let's start from here. > > >> > > >> Mihael > > >> > > >> On Fri, 2014-09-05 at 13:09 -0500, Tim Armstrong wrote: > > >> > Thanks, let me know if there's anything I can help do. > > >> > > > >> > - Tim > > >> > > > >> > > > >> > On Fri, Sep 5, 2014 at 12:57 PM, Mihael Hategan < > hategan at mcs.anl.gov> > > >> wrote: > > >> > > > >> > > Thanks. It also seems that there is an older bug in there in > which the > > >> > > client connection is not properly accounted for and things start > > >> failing > > >> > > two minutes after the client connects (which is also probably why > you > > >> > > didn't see this in runs with many short client connections). I'm > not > > >> > > sure why the fix for that bug isn't in the trunk code. > > >> > > > > >> > > In any event, I'll set up a client submission loop and fix all > these > > >> > > things. > > >> > > > > >> > > Mihael > > >> > > > > >> > > On Fri, 2014-09-05 at 12:13 -0500, Tim Armstrong wrote: > > >> > > > Ok, here it is with the additional debug messages. Source code > > >> change is > > >> > > > in commit 890c41f2ba701b10264553471590096d6f94c278. > > >> > > > > > >> > > > Warning: the tarball will expand to several gigabytes of logs. > > >> > > > > > >> > > > I had to do multiple client runs to trigger it. It seems like > the > > >> > > problem > > >> > > > might be triggered by abnormal termination of the client. > First 18 > > >> runs > > >> > > > went fine, problem only started when I ctrl-c-ed the swift/t > run #19 > > >> > > before > > >> > > > the run #20 that exhibited delays. > > >> > > > > > >> > > > http://people.cs.uchicago.edu/~tga/files/worker-logs3.tar.gz > > >> > > > > > >> > > > - Tim > > >> > > > > > >> > > > > > >> > > > On Fri, Sep 5, 2014 at 8:55 AM, Tim Armstrong < > > >> tim.g.armstrong at gmail.com > > >> > > > > > >> > > > wrote: > > >> > > > > > >> > > > > It's here: > > >> > > > > > > >> http://people.cs.uchicago.edu/~tga/files/coaster-service.out.full.gz > . > > >> > > > > > > >> > > > > I'll add some extra debug messages in the coaster C++ client > and > > >> see > > >> > > if I > > >> > > > > can recreate the scenario. > > >> > > > > > > >> > > > > - Tim > > >> > > > > > > >> > > > > > > >> > > > > On Thu, Sep 4, 2014 at 7:27 PM, Mihael Hategan < > > >> hategan at mcs.anl.gov> > > >> > > > > wrote: > > >> > > > > > > >> > > > >> Ok, so that's legit. > > >> > > > >> > > >> > > > >> It does look like shut down workers are not properly > accounted > > >> for in > > >> > > > >> some places (and I believe Yadu submitted a bug for this). > > >> However, I > > >> > > do > > >> > > > >> not see the dead time you mention in either of the last two > sets > > >> of > > >> > > > >> logs. It looks like each client instance submits a continous > > >> stream of > > >> > > > >> jobs. > > >> > > > >> > > >> > > > >> So let's get back to the initial log. Can I have the full > > >> service log? > > >> > > > >> I'm trying to track what happened with the jobs submitted > before > > >> the > > >> > > > >> first big pause. > > >> > > > >> > > >> > > > >> Also, a log message in CoasterClient::updateJobStatus() (or > > >> friends) > > >> > > > >> would probably help a lot here. > > >> > > > >> > > >> > > > >> Mihael > > >> > > > >> > > >> > > > >> On Thu, 2014-09-04 at 15:34 -0500, Tim Armstrong wrote: > > >> > > > >> > Should be here: > > >> > > > >> > > > >> > > > >> > http://people.cs.uchicago.edu/~tga/worker-logs2.tar.gz > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > On Thu, Sep 4, 2014 at 3:03 PM, Mihael Hategan < > > >> hategan at mcs.anl.gov > > >> > > > > > >> > > > >> wrote: > > >> > > > >> > > > >> > > > >> > > The first worker "failing" is 0904-20022331. The log > looks > > >> funny > > >> > > at > > >> > > > >> the > > >> > > > >> > > end. > > >> > > > >> > > > > >> > > > >> > > Can you git pull and re-run? The worker is getting some > > >> command > > >> > > at the > > >> > > > >> > > end there and doing nothing about it and I wonder why. > > >> > > > >> > > > > >> > > > >> > > Mihael > > >> > > > >> > > > > >> > > > >> > > On Thu, 2014-09-04 at 14:35 -0500, Tim Armstrong wrote: > > >> > > > >> > > > Ok, now I have some worker logs: > > >> > > > >> > > > > > >> > > > >> > > > > > >> http://people.cs.uchicago.edu/~tga/2014-9-4-worker-logs.tar.gz > > >> > > > >> > > > > > >> > > > >> > > > There's nothing obvious I see in the worker logs that > would > > >> > > > >> indicate why > > >> > > > >> > > > the connection was broken. > > >> > > > >> > > > > > >> > > > >> > > > - Tim > > >> > > > >> > > > > > >> > > > >> > > > > > >> > > > >> > > > On Thu, Sep 4, 2014 at 1:11 PM, Tim Armstrong < > > >> > > > >> tim.g.armstrong at gmail.com > > >> > > > >> > > > > > >> > > > >> > > > wrote: > > >> > > > >> > > > > > >> > > > >> > > > > This is all running locally on my laptop, so I think > we > > >> can > > >> > > rule > > >> > > > >> out > > >> > > > >> > > 1). > > >> > > > >> > > > > > > >> > > > >> > > > > It also seems like it's a state the coaster service > gets > > >> into > > >> > > > >> after a > > >> > > > >> > > few > > >> > > > >> > > > > client sessions: generally the first coaster run > works > > >> fine, > > >> > > then > > >> > > > >> > > after a > > >> > > > >> > > > > few runs the problem occurs more frequently. > > >> > > > >> > > > > > > >> > > > >> > > > > I'm going to try and get worker logs, in the meantime > > >> i've got > > >> > > > >> some > > >> > > > >> > > > > jstacks (attached). > > >> > > > >> > > > > > > >> > > > >> > > > > Matching service logs (largish) are here if needed: > > >> > > > >> > > > > http://people.cs.uchicago.edu/~tga/service.out.gz > > >> > > > >> > > > > > > >> > > > >> > > > > > > >> > > > >> > > > > On Wed, Sep 3, 2014 at 10:35 PM, Mihael Hategan < > > >> > > > >> hategan at mcs.anl.gov> > > >> > > > >> > > > > wrote: > > >> > > > >> > > > > > > >> > > > >> > > > >> Ah, makes sense. > > >> > > > >> > > > >> > > >> > > > >> > > > >> 2 minutes is the channel timeout. Each live > connection > > >> is > > >> > > > >> guaranteed > > >> > > > >> > > to > > >> > > > >> > > > >> have some communication for any 2 minute time > window, > > >> > > partially > > >> > > > >> due to > > >> > > > >> > > > >> periodic heartbeats (sent every 1 minute). If no > > >> packets flow > > >> > > > >> for the > > >> > > > >> > > > >> duration of 2 minutes, the connection is assumed > broken > > >> and > > >> > > all > > >> > > > >> jobs > > >> > > > >> > > > >> that were submitted to the respective workers are > > >> considered > > >> > > > >> failed. > > >> > > > >> > > So > > >> > > > >> > > > >> there seems to be an issue with the connections to > some > > >> of > > >> > > the > > >> > > > >> > > workers, > > >> > > > >> > > > >> and it takes 2 minutes to detect them. > > >> > > > >> > > > >> > > >> > > > >> > > > >> Since the service seems to be alive (although a > jstack > > >> on the > > >> > > > >> service > > >> > > > >> > > > >> when thing seem to hang might help), this leaves two > > >> > > > >> possibilities: > > >> > > > >> > > > >> 1 - some genuine network problem > > >> > > > >> > > > >> 2 - the worker died without properly closing TCP > > >> connections > > >> > > > >> > > > >> > > >> > > > >> > > > >> If (2), you could enable worker logging > > >> > > > >> > > > >> (Settings::Key::WORKER_LOGGING_LEVEL = "DEBUG") to > see > > >> if > > >> > > > >> anything > > >> > > > >> > > shows > > >> > > > >> > > > >> up. > > >> > > > >> > > > >> > > >> > > > >> > > > >> Mihael > > >> > > > >> > > > >> > > >> > > > >> > > > >> On Wed, 2014-09-03 at 20:26 -0500, Tim Armstrong > wrote: > > >> > > > >> > > > >> > Here are client and service logs, with part of > > >> service log > > >> > > > >> edited > > >> > > > >> > > down > > >> > > > >> > > > >> to > > >> > > > >> > > > >> > be a reasonable size (I have the full thing if > > >> needed, but > > >> > > it > > >> > > > >> was > > >> > > > >> > > over a > > >> > > > >> > > > >> > gigabyte). > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > One relevant section is from 19:49:35 onwards. > The > > >> client > > >> > > > >> submits 4 > > >> > > > >> > > > >> jobs > > >> > > > >> > > > >> > (its limit), but they don't complete until > 19:51:32 > > >> or so > > >> > > (I > > >> > > > >> can see > > >> > > > >> > > > >> that > > >> > > > >> > > > >> > one task completed based on ncompleted=1 in the > > >> > > check_tasks log > > >> > > > >> > > > >> message). > > >> > > > >> > > > >> > It looks like something has happened with broken > > >> pipes and > > >> > > > >> workers > > >> > > > >> > > being > > >> > > > >> > > > >> > lost, but I'm not sure what the ultimate cause of > > >> that is > > >> > > > >> likely to > > >> > > > >> > > be. > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > - Tim > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > On Wed, Sep 3, 2014 at 6:20 PM, Mihael Hategan < > > >> > > > >> hategan at mcs.anl.gov > > >> > > > >> > > > > > >> > > > >> > > > >> wrote: > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > Hi Tim, > > >> > > > >> > > > >> > > > > >> > > > >> > > > >> > > I've never seen this before with pure Java. > > >> > > > >> > > > >> > > > > >> > > > >> > > > >> > > Do you have logs from these runs? > > >> > > > >> > > > >> > > > > >> > > > >> > > > >> > > Mihael > > >> > > > >> > > > >> > > > > >> > > > >> > > > >> > > On Wed, 2014-09-03 at 16:49 -0500, Tim Armstrong > > >> wrote: > > >> > > > >> > > > >> > > > I'm running a test Swift/T script that submit > > >> tasks to > > >> > > > >> Coasters > > >> > > > >> > > > >> through > > >> > > > >> > > > >> > > the > > >> > > > >> > > > >> > > > C++ client and I'm seeing some odd behaviour > > >> where task > > >> > > > >> > > > >> > > > submission/execution is stalling for ~2 minute > > >> periods. > > >> > > > >> For > > >> > > > >> > > > >> example, I'm > > >> > > > >> > > > >> > > > seeing submit log messages like "submitting > > >> > > > >> > > > >> > > > urn:133-1409778135377-1409778135378: > > >> /bin/hostname" in > > >> > > > >> bursts of > > >> > > > >> > > > >> several > > >> > > > >> > > > >> > > > seconds with a gap of roughly 2 minutes in > > >> between, > > >> > > e.g. > > >> > > > >> I'm > > >> > > > >> > > seeing > > >> > > > >> > > > >> > > bursts > > >> > > > >> > > > >> > > > with the following intervals in my logs. > > >> > > > >> > > > >> > > > > > >> > > > >> > > > >> > > > 16:07:04,603 to 16:07:10,391 > > >> > > > >> > > > >> > > > 16:09:07,377 to 16:09:13,076 > > >> > > > >> > > > >> > > > 16:11:10,005 to 16:11:16,770 > > >> > > > >> > > > >> > > > 16:13:13,291 to 16:13:19,296 > > >> > > > >> > > > >> > > > 16:15:16,000 to 16:15:21,602 > > >> > > > >> > > > >> > > > > > >> > > > >> > > > >> > > > From what I can tell, the delay is on the > coaster > > >> > > service > > >> > > > >> side: > > >> > > > >> > > the > > >> > > > >> > > > >> C > > >> > > > >> > > > >> > > > client is just waiting for a response. > > >> > > > >> > > > >> > > > > > >> > > > >> > > > >> > > > The jobs are just being submitted through the > > >> local job > > >> > > > >> > > manager, so > > >> > > > >> > > > >> I > > >> > > > >> > > > >> > > > wouldn't expect any delays there. The tasks > are > > >> also > > >> > > just > > >> > > > >> > > > >> > > "/bin/hostname", > > >> > > > >> > > > >> > > > so should return immediately. > > >> > > > >> > > > >> > > > > > >> > > > >> > > > >> > > > I'm going to continue digging into this on my > > >> own, but > > >> > > the > > >> > > > >> 2 > > >> > > > >> > > minute > > >> > > > >> > > > >> delay > > >> > > > >> > > > >> > > > seems like a big clue: does anyone have an > idea > > >> what > > >> > > could > > >> > > > >> cause > > >> > > > >> > > > >> stalls > > >> > > > >> > > > >> > > in > > >> > > > >> > > > >> > > > task submission of 2 minute duration? > > >> > > > >> > > > >> > > > > > >> > > > >> > > > >> > > > Cheers, > > >> > > > >> > > > >> > > > Tim > > >> > > > >> > > > >> > > > > >> > > > >> > > > >> > > > > >> > > > >> > > > >> > > > > >> > > > >> > > > >> > > >> > > > >> > > > >> > > >> > > > >> > > > >> > > >> > > > >> > > > > > > >> > > > >> > > > > >> > > > >> > > > > >> > > > >> > > > > >> > > > >> > > >> > > > >> > > >> > > > >> > > >> > > > > > > >> > > > > >> > > > > >> > > > > >> > > >> > > >> > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Sep 11 12:54:46 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 11 Sep 2014 10:54:46 -0700 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo> <1409860986.7960.3.camel@echo> <1409876838.3600.8.camel@echo> <1409939838.12288.3.camel@echo> <1410205101.24345.22.camel@echo> <1410457173.25856.5.camel@echo> Message-ID: <1410458086.26144.0.camel@echo> I thought we switched to git. Mihael On Thu, 2014-09-11 at 12:41 -0500, Tim Armstrong wrote: > I thought I was running the latest trunk, I'll rebuild and see if I can > reproduce the issue. > > - Tim > > On Thu, Sep 11, 2014 at 12:39 PM, Mihael Hategan > wrote: > > > The method "getMetaChannel()" has been removed. Where did you get the > > code from? > > > > Mihael > > > > On Thu, 2014-09-11 at 12:16 -0500, Tim Armstrong wrote: > > > I'm seeing failures when running Swift/T tests with > > > start-coaster-service.sh. > > > > > > E.g. the turbine test coaster-exec-1. I can provide instructions for > > > running the test if needed (roughly, you need to build Swift/T with > > coaster > > > support enabled, then make tests/coaster-exec-1.result in the turbine > > > directory). The github swift-t release is up to date if you want to use > > > that. > > > > > > Full log is attached, stack trace excerpt is below. > > > > > > - Tim > > > > > > 2014-09-11 12:11:13,708-0500 INFO BlockQueueProcessor Starting... > > > id=0911-1112130 > > > Using threaded sender for TCPChannel [type: server, contact: > > 127.0.0.1:48242 > > > ] > > > 2014-09-11 12:11:13,708-0500 INFO AbstractStreamCoasterChannel Using > > > threaded sender for TCPChannel [type: server, contact: 127.0.0.1:48242] > > > org.globus.cog.coaster.channels.ChannelException: Invalid channel: null > > > @id://null-nullS > > > at > > > > > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) > > > at > > > > > org.globus.cog.coaster.channels.ChannelManager.reserveChannel(ChannelManager.java:226) > > > at > > > > > org.globus.cog.abstraction.coaster.service.job.manager.PassiveQueueProcessor.setClientChannelContext(PassiveQueueProcessor.java:41) > > > at > > > > > org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.setClientChannelContext(JobQueue.java:135) > > > at > > > > > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:77) > > > at > > > > > org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) > > > at > > > > > org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) > > > at > > > > > org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) > > > at > > org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) > > > provider=local > > > 2014-09-11 12:11:13,930-0500 INFO ExecutionTaskHandler provider=local > > > org.globus.cog.coaster.channels.ChannelException: Invalid channel: null > > > @id://null-nullS > > > at > > > > > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) > > > at > > > > > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:432) > > > at > > > > > org.globus.cog.coaster.channels.ChannelManager.reserveLongTerm(ChannelManager.java:472) > > > at > > > > > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:80) > > > at > > > > > org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) > > > at > > > > > org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) > > > at > > > > > org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) > > > at > > org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) > > > Handler(tag: 38907, SUBMITJOB) sending error: Could not deserialize job > > > description > > > org.globus.cog.coaster.ProtocolException: Could not deserialize job > > > description > > > at > > > > > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:84) > > > at > > > > > org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) > > > at > > > > > org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) > > > at > > > > > org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) > > > at > > org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) > > > Caused by: org.globus.cog.coaster.channels.ChannelException: Invalid > > > channel: null at id://null-nullS > > > at > > > > > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) > > > at > > > > > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:432) > > > at > > > > > org.globus.cog.coaster.channels.ChannelManager.reserveLongTerm(ChannelManager.java:472) > > > at > > > > > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:80) > > > ... 4 more > > > 2014-09-11 12:11:13,937-0500 INFO RequestReply Handler(tag: 38907, > > > SUBMITJOB) sending error: Could not deserialize job description > > > org.globus.cog.coaster.ProtocolException: Could not deserialize job > > > description > > > at > > > > > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:84) > > > at > > > > > org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) > > > at > > > > > org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) > > > at > > > > > org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) > > > at > > org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) > > > Caused by: org.globus.cog.coaster.channels.ChannelException: Invalid > > > channel: null at id://null-nullS > > > at > > > > > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) > > > at > > > > > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:432) > > > at > > > > > org.globus.cog.coaster.channels.ChannelManager.reserveLongTerm(ChannelManager.java:472) > > > at > > > > > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:80) > > > ... 4 more > > > > > > > > > On Thu, Sep 11, 2014 at 10:30 AM, Tim Armstrong < > > tim.g.armstrong at gmail.com> > > > wrote: > > > > > > > This all sounds great. > > > > > > > > Just to check that I've understood correctly, from the client's point > > of > > > > view: > > > > * The per-client settings behave the same if -shared is not provided. > > > > * Per-client settings are ignored if -shared is provided > > > > > > > > I had one question: > > > > * Do automatically allocated workers work with per-client settings? I > > > > understand there were some issues related to sharing workers between > > > > clients. Was the solution to have separate worker pools, or is this > > just > > > > not supported? > > > > > > > > - Tim > > > > > > > > On Mon, Sep 8, 2014 at 2:38 PM, Mihael Hategan > > > > wrote: > > > > > > > >> So... > > > >> > > > >> There were bugs. Lots of bugs. > > > >> I did some work over the weekend to fix some of these and clean up the > > > >> coaster code. Here's a summary: > > > >> > > > >> - there was some stuff in the low level coaster code to deal with > > > >> persisting coaster channels over multiple connections with various > > > >> options, like periodic connections, client or server initiated > > > >> connections, buffering of commands, etc. None of this was used by > > Swift, > > > >> and the code was pretty messy. I removed that. > > > >> - there were some issues with multiple clients: > > > >> * improper shutdown of relevant workers when a client disconnected > > > >> * the worker task dispatcher was a singleton and had a reference to > > > >> one block allocator, whereas multiple clients involved multiple > > > >> allocators. > > > >> - there were a bunch of locking issues in the C client that valgrind > > > >> caught > > > >> - the idea of remote job ids was a bit hard to work with. This remote > > id > > > >> was the job id that the service assigned to a job. This is necessary > > > >> because two different clients can submit jobs with the same id. The > > > >> remote id would be communicated to the client as the reply to the > > submit > > > >> request. However, it was entirely possible for a notification about > > job > > > >> status to be sent to the client before the submit reply was. Since > > > >> notifications were sent using the remote-id, the client would have no > > > >> idea what job the notifications belonged to. Now, the server might > > need > > > >> a unique job id, but there is no reason why it cannot use the client > > id > > > >> when communicating the status to a client. So that's there now. > > > >> - the way the C client was working, its jobs ended up not going to the > > > >> workers, but the local queue. The service settings now allow > > specifying > > > >> the provider/jobManager/url to be used to start blocks, and jobs are > > > >> routed appropriately if they do not have the batch job flag set. > > > >> > > > >> I also added a shared service mode. We discussed this before. > > Basically > > > >> you start the coaster service with "-shared " and > > > >> all the settings are read from that file. In this case, all clients > > > >> share the same worker pool, and client settings are ignored. > > > >> > > > >> The C client now has a multi-job testing tool which can submit many > > jobs > > > >> with the desired level of concurrency. > > > >> > > > >> I have tested the C client with both shared and non-shared mode, with > > > >> various levels of jobs being sent, with either one or two concurrent > > > >> clients. > > > >> > > > >> I haven't tested manual workers. > > > >> > > > >> I've also decided that during normal operation (i.e. client connects, > > > >> submits jobs, shuts down gracefully), there should be no exceptions in > > > >> the coaster log. I think we should stick to that principle. This was > > the > > > >> case last I tested, and we should consider any deviation from that to > > be > > > >> a problem. Of course, there are some things for which there is no > > > >> graceful shut down, such as ctrl+C-ing a manual worker. Exceptions are > > > >> fine in that case. > > > >> > > > >> So anyway, let's start from here. > > > >> > > > >> Mihael > > > >> > > > >> On Fri, 2014-09-05 at 13:09 -0500, Tim Armstrong wrote: > > > >> > Thanks, let me know if there's anything I can help do. > > > >> > > > > >> > - Tim > > > >> > > > > >> > > > > >> > On Fri, Sep 5, 2014 at 12:57 PM, Mihael Hategan < > > hategan at mcs.anl.gov> > > > >> wrote: > > > >> > > > > >> > > Thanks. It also seems that there is an older bug in there in > > which the > > > >> > > client connection is not properly accounted for and things start > > > >> failing > > > >> > > two minutes after the client connects (which is also probably why > > you > > > >> > > didn't see this in runs with many short client connections). I'm > > not > > > >> > > sure why the fix for that bug isn't in the trunk code. > > > >> > > > > > >> > > In any event, I'll set up a client submission loop and fix all > > these > > > >> > > things. > > > >> > > > > > >> > > Mihael > > > >> > > > > > >> > > On Fri, 2014-09-05 at 12:13 -0500, Tim Armstrong wrote: > > > >> > > > Ok, here it is with the additional debug messages. Source code > > > >> change is > > > >> > > > in commit 890c41f2ba701b10264553471590096d6f94c278. > > > >> > > > > > > >> > > > Warning: the tarball will expand to several gigabytes of logs. > > > >> > > > > > > >> > > > I had to do multiple client runs to trigger it. It seems like > > the > > > >> > > problem > > > >> > > > might be triggered by abnormal termination of the client. > > First 18 > > > >> runs > > > >> > > > went fine, problem only started when I ctrl-c-ed the swift/t > > run #19 > > > >> > > before > > > >> > > > the run #20 that exhibited delays. > > > >> > > > > > > >> > > > http://people.cs.uchicago.edu/~tga/files/worker-logs3.tar.gz > > > >> > > > > > > >> > > > - Tim > > > >> > > > > > > >> > > > > > > >> > > > On Fri, Sep 5, 2014 at 8:55 AM, Tim Armstrong < > > > >> tim.g.armstrong at gmail.com > > > >> > > > > > > >> > > > wrote: > > > >> > > > > > > >> > > > > It's here: > > > >> > > > > > > > >> http://people.cs.uchicago.edu/~tga/files/coaster-service.out.full.gz > > . > > > >> > > > > > > > >> > > > > I'll add some extra debug messages in the coaster C++ client > > and > > > >> see > > > >> > > if I > > > >> > > > > can recreate the scenario. > > > >> > > > > > > > >> > > > > - Tim > > > >> > > > > > > > >> > > > > > > > >> > > > > On Thu, Sep 4, 2014 at 7:27 PM, Mihael Hategan < > > > >> hategan at mcs.anl.gov> > > > >> > > > > wrote: > > > >> > > > > > > > >> > > > >> Ok, so that's legit. > > > >> > > > >> > > > >> > > > >> It does look like shut down workers are not properly > > accounted > > > >> for in > > > >> > > > >> some places (and I believe Yadu submitted a bug for this). > > > >> However, I > > > >> > > do > > > >> > > > >> not see the dead time you mention in either of the last two > > sets > > > >> of > > > >> > > > >> logs. It looks like each client instance submits a continous > > > >> stream of > > > >> > > > >> jobs. > > > >> > > > >> > > > >> > > > >> So let's get back to the initial log. Can I have the full > > > >> service log? > > > >> > > > >> I'm trying to track what happened with the jobs submitted > > before > > > >> the > > > >> > > > >> first big pause. > > > >> > > > >> > > > >> > > > >> Also, a log message in CoasterClient::updateJobStatus() (or > > > >> friends) > > > >> > > > >> would probably help a lot here. > > > >> > > > >> > > > >> > > > >> Mihael > > > >> > > > >> > > > >> > > > >> On Thu, 2014-09-04 at 15:34 -0500, Tim Armstrong wrote: > > > >> > > > >> > Should be here: > > > >> > > > >> > > > > >> > > > >> > http://people.cs.uchicago.edu/~tga/worker-logs2.tar.gz > > > >> > > > >> > > > > >> > > > >> > > > > >> > > > >> > > > > >> > > > >> > > > > >> > > > >> > On Thu, Sep 4, 2014 at 3:03 PM, Mihael Hategan < > > > >> hategan at mcs.anl.gov > > > >> > > > > > > >> > > > >> wrote: > > > >> > > > >> > > > > >> > > > >> > > The first worker "failing" is 0904-20022331. The log > > looks > > > >> funny > > > >> > > at > > > >> > > > >> the > > > >> > > > >> > > end. > > > >> > > > >> > > > > > >> > > > >> > > Can you git pull and re-run? The worker is getting some > > > >> command > > > >> > > at the > > > >> > > > >> > > end there and doing nothing about it and I wonder why. > > > >> > > > >> > > > > > >> > > > >> > > Mihael > > > >> > > > >> > > > > > >> > > > >> > > On Thu, 2014-09-04 at 14:35 -0500, Tim Armstrong wrote: > > > >> > > > >> > > > Ok, now I have some worker logs: > > > >> > > > >> > > > > > > >> > > > >> > > > > > > >> http://people.cs.uchicago.edu/~tga/2014-9-4-worker-logs.tar.gz > > > >> > > > >> > > > > > > >> > > > >> > > > There's nothing obvious I see in the worker logs that > > would > > > >> > > > >> indicate why > > > >> > > > >> > > > the connection was broken. > > > >> > > > >> > > > > > > >> > > > >> > > > - Tim > > > >> > > > >> > > > > > > >> > > > >> > > > > > > >> > > > >> > > > On Thu, Sep 4, 2014 at 1:11 PM, Tim Armstrong < > > > >> > > > >> tim.g.armstrong at gmail.com > > > >> > > > >> > > > > > > >> > > > >> > > > wrote: > > > >> > > > >> > > > > > > >> > > > >> > > > > This is all running locally on my laptop, so I think > > we > > > >> can > > > >> > > rule > > > >> > > > >> out > > > >> > > > >> > > 1). > > > >> > > > >> > > > > > > > >> > > > >> > > > > It also seems like it's a state the coaster service > > gets > > > >> into > > > >> > > > >> after a > > > >> > > > >> > > few > > > >> > > > >> > > > > client sessions: generally the first coaster run > > works > > > >> fine, > > > >> > > then > > > >> > > > >> > > after a > > > >> > > > >> > > > > few runs the problem occurs more frequently. > > > >> > > > >> > > > > > > > >> > > > >> > > > > I'm going to try and get worker logs, in the meantime > > > >> i've got > > > >> > > > >> some > > > >> > > > >> > > > > jstacks (attached). > > > >> > > > >> > > > > > > > >> > > > >> > > > > Matching service logs (largish) are here if needed: > > > >> > > > >> > > > > http://people.cs.uchicago.edu/~tga/service.out.gz > > > >> > > > >> > > > > > > > >> > > > >> > > > > > > > >> > > > >> > > > > On Wed, Sep 3, 2014 at 10:35 PM, Mihael Hategan < > > > >> > > > >> hategan at mcs.anl.gov> > > > >> > > > >> > > > > wrote: > > > >> > > > >> > > > > > > > >> > > > >> > > > >> Ah, makes sense. > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> 2 minutes is the channel timeout. Each live > > connection > > > >> is > > > >> > > > >> guaranteed > > > >> > > > >> > > to > > > >> > > > >> > > > >> have some communication for any 2 minute time > > window, > > > >> > > partially > > > >> > > > >> due to > > > >> > > > >> > > > >> periodic heartbeats (sent every 1 minute). If no > > > >> packets flow > > > >> > > > >> for the > > > >> > > > >> > > > >> duration of 2 minutes, the connection is assumed > > broken > > > >> and > > > >> > > all > > > >> > > > >> jobs > > > >> > > > >> > > > >> that were submitted to the respective workers are > > > >> considered > > > >> > > > >> failed. > > > >> > > > >> > > So > > > >> > > > >> > > > >> there seems to be an issue with the connections to > > some > > > >> of > > > >> > > the > > > >> > > > >> > > workers, > > > >> > > > >> > > > >> and it takes 2 minutes to detect them. > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> Since the service seems to be alive (although a > > jstack > > > >> on the > > > >> > > > >> service > > > >> > > > >> > > > >> when thing seem to hang might help), this leaves two > > > >> > > > >> possibilities: > > > >> > > > >> > > > >> 1 - some genuine network problem > > > >> > > > >> > > > >> 2 - the worker died without properly closing TCP > > > >> connections > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> If (2), you could enable worker logging > > > >> > > > >> > > > >> (Settings::Key::WORKER_LOGGING_LEVEL = "DEBUG") to > > see > > > >> if > > > >> > > > >> anything > > > >> > > > >> > > shows > > > >> > > > >> > > > >> up. > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> Mihael > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> On Wed, 2014-09-03 at 20:26 -0500, Tim Armstrong > > wrote: > > > >> > > > >> > > > >> > Here are client and service logs, with part of > > > >> service log > > > >> > > > >> edited > > > >> > > > >> > > down > > > >> > > > >> > > > >> to > > > >> > > > >> > > > >> > be a reasonable size (I have the full thing if > > > >> needed, but > > > >> > > it > > > >> > > > >> was > > > >> > > > >> > > over a > > > >> > > > >> > > > >> > gigabyte). > > > >> > > > >> > > > >> > > > > >> > > > >> > > > >> > One relevant section is from 19:49:35 onwards. > > The > > > >> client > > > >> > > > >> submits 4 > > > >> > > > >> > > > >> jobs > > > >> > > > >> > > > >> > (its limit), but they don't complete until > > 19:51:32 > > > >> or so > > > >> > > (I > > > >> > > > >> can see > > > >> > > > >> > > > >> that > > > >> > > > >> > > > >> > one task completed based on ncompleted=1 in the > > > >> > > check_tasks log > > > >> > > > >> > > > >> message). > > > >> > > > >> > > > >> > It looks like something has happened with broken > > > >> pipes and > > > >> > > > >> workers > > > >> > > > >> > > being > > > >> > > > >> > > > >> > lost, but I'm not sure what the ultimate cause of > > > >> that is > > > >> > > > >> likely to > > > >> > > > >> > > be. > > > >> > > > >> > > > >> > > > > >> > > > >> > > > >> > - Tim > > > >> > > > >> > > > >> > > > > >> > > > >> > > > >> > > > > >> > > > >> > > > >> > > > > >> > > > >> > > > >> > On Wed, Sep 3, 2014 at 6:20 PM, Mihael Hategan < > > > >> > > > >> hategan at mcs.anl.gov > > > >> > > > >> > > > > > > >> > > > >> > > > >> wrote: > > > >> > > > >> > > > >> > > > > >> > > > >> > > > >> > > Hi Tim, > > > >> > > > >> > > > >> > > > > > >> > > > >> > > > >> > > I've never seen this before with pure Java. > > > >> > > > >> > > > >> > > > > > >> > > > >> > > > >> > > Do you have logs from these runs? > > > >> > > > >> > > > >> > > > > > >> > > > >> > > > >> > > Mihael > > > >> > > > >> > > > >> > > > > > >> > > > >> > > > >> > > On Wed, 2014-09-03 at 16:49 -0500, Tim Armstrong > > > >> wrote: > > > >> > > > >> > > > >> > > > I'm running a test Swift/T script that submit > > > >> tasks to > > > >> > > > >> Coasters > > > >> > > > >> > > > >> through > > > >> > > > >> > > > >> > > the > > > >> > > > >> > > > >> > > > C++ client and I'm seeing some odd behaviour > > > >> where task > > > >> > > > >> > > > >> > > > submission/execution is stalling for ~2 minute > > > >> periods. > > > >> > > > >> For > > > >> > > > >> > > > >> example, I'm > > > >> > > > >> > > > >> > > > seeing submit log messages like "submitting > > > >> > > > >> > > > >> > > > urn:133-1409778135377-1409778135378: > > > >> /bin/hostname" in > > > >> > > > >> bursts of > > > >> > > > >> > > > >> several > > > >> > > > >> > > > >> > > > seconds with a gap of roughly 2 minutes in > > > >> between, > > > >> > > e.g. > > > >> > > > >> I'm > > > >> > > > >> > > seeing > > > >> > > > >> > > > >> > > bursts > > > >> > > > >> > > > >> > > > with the following intervals in my logs. > > > >> > > > >> > > > >> > > > > > > >> > > > >> > > > >> > > > 16:07:04,603 to 16:07:10,391 > > > >> > > > >> > > > >> > > > 16:09:07,377 to 16:09:13,076 > > > >> > > > >> > > > >> > > > 16:11:10,005 to 16:11:16,770 > > > >> > > > >> > > > >> > > > 16:13:13,291 to 16:13:19,296 > > > >> > > > >> > > > >> > > > 16:15:16,000 to 16:15:21,602 > > > >> > > > >> > > > >> > > > > > > >> > > > >> > > > >> > > > From what I can tell, the delay is on the > > coaster > > > >> > > service > > > >> > > > >> side: > > > >> > > > >> > > the > > > >> > > > >> > > > >> C > > > >> > > > >> > > > >> > > > client is just waiting for a response. > > > >> > > > >> > > > >> > > > > > > >> > > > >> > > > >> > > > The jobs are just being submitted through the > > > >> local job > > > >> > > > >> > > manager, so > > > >> > > > >> > > > >> I > > > >> > > > >> > > > >> > > > wouldn't expect any delays there. The tasks > > are > > > >> also > > > >> > > just > > > >> > > > >> > > > >> > > "/bin/hostname", > > > >> > > > >> > > > >> > > > so should return immediately. > > > >> > > > >> > > > >> > > > > > > >> > > > >> > > > >> > > > I'm going to continue digging into this on my > > > >> own, but > > > >> > > the > > > >> > > > >> 2 > > > >> > > > >> > > minute > > > >> > > > >> > > > >> delay > > > >> > > > >> > > > >> > > > seems like a big clue: does anyone have an > > idea > > > >> what > > > >> > > could > > > >> > > > >> cause > > > >> > > > >> > > > >> stalls > > > >> > > > >> > > > >> > > in > > > >> > > > >> > > > >> > > > task submission of 2 minute duration? > > > >> > > > >> > > > >> > > > > > > >> > > > >> > > > >> > > > Cheers, > > > >> > > > >> > > > >> > > > Tim > > > >> > > > >> > > > >> > > > > > >> > > > >> > > > >> > > > > > >> > > > >> > > > >> > > > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > > > > > >> > > > >> > > > > > >> > > > >> > > > > > >> > > > >> > > > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > >> > > > >> > > > > > > > > > > From tim.g.armstrong at gmail.com Thu Sep 11 13:10:52 2014 From: tim.g.armstrong at gmail.com (Tim Armstrong) Date: Thu, 11 Sep 2014 13:10:52 -0500 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo> <1409860986.7960.3.camel@echo> <1409876838.3600.8.camel@echo> <1409939838.12288.3.camel@echo> <1410205101.24345.22.camel@echo> <1410457173.25856.5.camel@echo> Message-ID: I meant the github master., but it turns out that I had had the wrong Swift on my path. Apologies for the confusion. I've rerun with the current one. I'm getting a null pointer exception on line 226 of BlockQueueProcessor.java. Adding some printfs revealed that settings was null. Log attached. - Tim Job: Job(id:0 600.000s) Settings: null java.lang.NullPointerException at org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.checkJob(BlockQueueProcessor.java:228) at org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.enqueue1(BlockQueueProcessor.java:210) at org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.enqueue(BlockQueueProcessor.java:204) at org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.enqueue(JobQueue.java:103) at org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:96) at org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:112) at org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:589) at org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:175) at org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:90) On Thu, Sep 11, 2014 at 12:41 PM, Tim Armstrong wrote: > I thought I was running the latest trunk, I'll rebuild and see if I can > reproduce the issue. > > - Tim > > On Thu, Sep 11, 2014 at 12:39 PM, Mihael Hategan > wrote: > >> The method "getMetaChannel()" has been removed. Where did you get the >> code from? >> >> Mihael >> >> On Thu, 2014-09-11 at 12:16 -0500, Tim Armstrong wrote: >> > I'm seeing failures when running Swift/T tests with >> > start-coaster-service.sh. >> > >> > E.g. the turbine test coaster-exec-1. I can provide instructions for >> > running the test if needed (roughly, you need to build Swift/T with >> coaster >> > support enabled, then make tests/coaster-exec-1.result in the turbine >> > directory). The github swift-t release is up to date if you want to use >> > that. >> > >> > Full log is attached, stack trace excerpt is below. >> > >> > - Tim >> > >> > 2014-09-11 12:11:13,708-0500 INFO BlockQueueProcessor Starting... >> > id=0911-1112130 >> > Using threaded sender for TCPChannel [type: server, contact: >> 127.0.0.1:48242 >> > ] >> > 2014-09-11 12:11:13,708-0500 INFO AbstractStreamCoasterChannel Using >> > threaded sender for TCPChannel [type: server, contact: 127.0.0.1:48242] >> > org.globus.cog.coaster.channels.ChannelException: Invalid channel: null >> > @id://null-nullS >> > at >> > >> org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) >> > at >> > >> org.globus.cog.coaster.channels.ChannelManager.reserveChannel(ChannelManager.java:226) >> > at >> > >> org.globus.cog.abstraction.coaster.service.job.manager.PassiveQueueProcessor.setClientChannelContext(PassiveQueueProcessor.java:41) >> > at >> > >> org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.setClientChannelContext(JobQueue.java:135) >> > at >> > >> org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:77) >> > at >> > >> org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) >> > at >> > >> org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) >> > at >> > >> org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) >> > at >> org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) >> > provider=local >> > 2014-09-11 12:11:13,930-0500 INFO ExecutionTaskHandler provider=local >> > org.globus.cog.coaster.channels.ChannelException: Invalid channel: null >> > @id://null-nullS >> > at >> > >> org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) >> > at >> > >> org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:432) >> > at >> > >> org.globus.cog.coaster.channels.ChannelManager.reserveLongTerm(ChannelManager.java:472) >> > at >> > >> org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:80) >> > at >> > >> org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) >> > at >> > >> org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) >> > at >> > >> org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) >> > at >> org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) >> > Handler(tag: 38907, SUBMITJOB) sending error: Could not deserialize job >> > description >> > org.globus.cog.coaster.ProtocolException: Could not deserialize job >> > description >> > at >> > >> org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:84) >> > at >> > >> org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) >> > at >> > >> org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) >> > at >> > >> org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) >> > at >> org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) >> > Caused by: org.globus.cog.coaster.channels.ChannelException: Invalid >> > channel: null at id://null-nullS >> > at >> > >> org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) >> > at >> > >> org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:432) >> > at >> > >> org.globus.cog.coaster.channels.ChannelManager.reserveLongTerm(ChannelManager.java:472) >> > at >> > >> org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:80) >> > ... 4 more >> > 2014-09-11 12:11:13,937-0500 INFO RequestReply Handler(tag: 38907, >> > SUBMITJOB) sending error: Could not deserialize job description >> > org.globus.cog.coaster.ProtocolException: Could not deserialize job >> > description >> > at >> > >> org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:84) >> > at >> > >> org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) >> > at >> > >> org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) >> > at >> > >> org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) >> > at >> org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) >> > Caused by: org.globus.cog.coaster.channels.ChannelException: Invalid >> > channel: null at id://null-nullS >> > at >> > >> org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) >> > at >> > >> org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:432) >> > at >> > >> org.globus.cog.coaster.channels.ChannelManager.reserveLongTerm(ChannelManager.java:472) >> > at >> > >> org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:80) >> > ... 4 more >> > >> > >> > On Thu, Sep 11, 2014 at 10:30 AM, Tim Armstrong < >> tim.g.armstrong at gmail.com> >> > wrote: >> > >> > > This all sounds great. >> > > >> > > Just to check that I've understood correctly, from the client's point >> of >> > > view: >> > > * The per-client settings behave the same if -shared is not provided. >> > > * Per-client settings are ignored if -shared is provided >> > > >> > > I had one question: >> > > * Do automatically allocated workers work with per-client settings? I >> > > understand there were some issues related to sharing workers between >> > > clients. Was the solution to have separate worker pools, or is this >> just >> > > not supported? >> > > >> > > - Tim >> > > >> > > On Mon, Sep 8, 2014 at 2:38 PM, Mihael Hategan >> > > wrote: >> > > >> > >> So... >> > >> >> > >> There were bugs. Lots of bugs. >> > >> I did some work over the weekend to fix some of these and clean up >> the >> > >> coaster code. Here's a summary: >> > >> >> > >> - there was some stuff in the low level coaster code to deal with >> > >> persisting coaster channels over multiple connections with various >> > >> options, like periodic connections, client or server initiated >> > >> connections, buffering of commands, etc. None of this was used by >> Swift, >> > >> and the code was pretty messy. I removed that. >> > >> - there were some issues with multiple clients: >> > >> * improper shutdown of relevant workers when a client disconnected >> > >> * the worker task dispatcher was a singleton and had a reference to >> > >> one block allocator, whereas multiple clients involved multiple >> > >> allocators. >> > >> - there were a bunch of locking issues in the C client that valgrind >> > >> caught >> > >> - the idea of remote job ids was a bit hard to work with. This >> remote id >> > >> was the job id that the service assigned to a job. This is necessary >> > >> because two different clients can submit jobs with the same id. The >> > >> remote id would be communicated to the client as the reply to the >> submit >> > >> request. However, it was entirely possible for a notification about >> job >> > >> status to be sent to the client before the submit reply was. Since >> > >> notifications were sent using the remote-id, the client would have no >> > >> idea what job the notifications belonged to. Now, the server might >> need >> > >> a unique job id, but there is no reason why it cannot use the client >> id >> > >> when communicating the status to a client. So that's there now. >> > >> - the way the C client was working, its jobs ended up not going to >> the >> > >> workers, but the local queue. The service settings now allow >> specifying >> > >> the provider/jobManager/url to be used to start blocks, and jobs are >> > >> routed appropriately if they do not have the batch job flag set. >> > >> >> > >> I also added a shared service mode. We discussed this before. >> Basically >> > >> you start the coaster service with "-shared " and >> > >> all the settings are read from that file. In this case, all clients >> > >> share the same worker pool, and client settings are ignored. >> > >> >> > >> The C client now has a multi-job testing tool which can submit many >> jobs >> > >> with the desired level of concurrency. >> > >> >> > >> I have tested the C client with both shared and non-shared mode, with >> > >> various levels of jobs being sent, with either one or two concurrent >> > >> clients. >> > >> >> > >> I haven't tested manual workers. >> > >> >> > >> I've also decided that during normal operation (i.e. client connects, >> > >> submits jobs, shuts down gracefully), there should be no exceptions >> in >> > >> the coaster log. I think we should stick to that principle. This was >> the >> > >> case last I tested, and we should consider any deviation from that >> to be >> > >> a problem. Of course, there are some things for which there is no >> > >> graceful shut down, such as ctrl+C-ing a manual worker. Exceptions >> are >> > >> fine in that case. >> > >> >> > >> So anyway, let's start from here. >> > >> >> > >> Mihael >> > >> >> > >> On Fri, 2014-09-05 at 13:09 -0500, Tim Armstrong wrote: >> > >> > Thanks, let me know if there's anything I can help do. >> > >> > >> > >> > - Tim >> > >> > >> > >> > >> > >> > On Fri, Sep 5, 2014 at 12:57 PM, Mihael Hategan < >> hategan at mcs.anl.gov> >> > >> wrote: >> > >> > >> > >> > > Thanks. It also seems that there is an older bug in there in >> which the >> > >> > > client connection is not properly accounted for and things start >> > >> failing >> > >> > > two minutes after the client connects (which is also probably >> why you >> > >> > > didn't see this in runs with many short client connections). I'm >> not >> > >> > > sure why the fix for that bug isn't in the trunk code. >> > >> > > >> > >> > > In any event, I'll set up a client submission loop and fix all >> these >> > >> > > things. >> > >> > > >> > >> > > Mihael >> > >> > > >> > >> > > On Fri, 2014-09-05 at 12:13 -0500, Tim Armstrong wrote: >> > >> > > > Ok, here it is with the additional debug messages. Source code >> > >> change is >> > >> > > > in commit 890c41f2ba701b10264553471590096d6f94c278. >> > >> > > > >> > >> > > > Warning: the tarball will expand to several gigabytes of logs. >> > >> > > > >> > >> > > > I had to do multiple client runs to trigger it. It seems like >> the >> > >> > > problem >> > >> > > > might be triggered by abnormal termination of the client. >> First 18 >> > >> runs >> > >> > > > went fine, problem only started when I ctrl-c-ed the swift/t >> run #19 >> > >> > > before >> > >> > > > the run #20 that exhibited delays. >> > >> > > > >> > >> > > > http://people.cs.uchicago.edu/~tga/files/worker-logs3.tar.gz >> > >> > > > >> > >> > > > - Tim >> > >> > > > >> > >> > > > >> > >> > > > On Fri, Sep 5, 2014 at 8:55 AM, Tim Armstrong < >> > >> tim.g.armstrong at gmail.com >> > >> > > > >> > >> > > > wrote: >> > >> > > > >> > >> > > > > It's here: >> > >> > > > > >> > >> http://people.cs.uchicago.edu/~tga/files/coaster-service.out.full.gz >> . >> > >> > > > > >> > >> > > > > I'll add some extra debug messages in the coaster C++ client >> and >> > >> see >> > >> > > if I >> > >> > > > > can recreate the scenario. >> > >> > > > > >> > >> > > > > - Tim >> > >> > > > > >> > >> > > > > >> > >> > > > > On Thu, Sep 4, 2014 at 7:27 PM, Mihael Hategan < >> > >> hategan at mcs.anl.gov> >> > >> > > > > wrote: >> > >> > > > > >> > >> > > > >> Ok, so that's legit. >> > >> > > > >> >> > >> > > > >> It does look like shut down workers are not properly >> accounted >> > >> for in >> > >> > > > >> some places (and I believe Yadu submitted a bug for this). >> > >> However, I >> > >> > > do >> > >> > > > >> not see the dead time you mention in either of the last two >> sets >> > >> of >> > >> > > > >> logs. It looks like each client instance submits a continous >> > >> stream of >> > >> > > > >> jobs. >> > >> > > > >> >> > >> > > > >> So let's get back to the initial log. Can I have the full >> > >> service log? >> > >> > > > >> I'm trying to track what happened with the jobs submitted >> before >> > >> the >> > >> > > > >> first big pause. >> > >> > > > >> >> > >> > > > >> Also, a log message in CoasterClient::updateJobStatus() (or >> > >> friends) >> > >> > > > >> would probably help a lot here. >> > >> > > > >> >> > >> > > > >> Mihael >> > >> > > > >> >> > >> > > > >> On Thu, 2014-09-04 at 15:34 -0500, Tim Armstrong wrote: >> > >> > > > >> > Should be here: >> > >> > > > >> > >> > >> > > > >> > http://people.cs.uchicago.edu/~tga/worker-logs2.tar.gz >> > >> > > > >> > >> > >> > > > >> > >> > >> > > > >> > >> > >> > > > >> > >> > >> > > > >> > On Thu, Sep 4, 2014 at 3:03 PM, Mihael Hategan < >> > >> hategan at mcs.anl.gov >> > >> > > > >> > >> > > > >> wrote: >> > >> > > > >> > >> > >> > > > >> > > The first worker "failing" is 0904-20022331. The log >> looks >> > >> funny >> > >> > > at >> > >> > > > >> the >> > >> > > > >> > > end. >> > >> > > > >> > > >> > >> > > > >> > > Can you git pull and re-run? The worker is getting some >> > >> command >> > >> > > at the >> > >> > > > >> > > end there and doing nothing about it and I wonder why. >> > >> > > > >> > > >> > >> > > > >> > > Mihael >> > >> > > > >> > > >> > >> > > > >> > > On Thu, 2014-09-04 at 14:35 -0500, Tim Armstrong wrote: >> > >> > > > >> > > > Ok, now I have some worker logs: >> > >> > > > >> > > > >> > >> > > > >> > > > >> > >> http://people.cs.uchicago.edu/~tga/2014-9-4-worker-logs.tar.gz >> > >> > > > >> > > > >> > >> > > > >> > > > There's nothing obvious I see in the worker logs that >> would >> > >> > > > >> indicate why >> > >> > > > >> > > > the connection was broken. >> > >> > > > >> > > > >> > >> > > > >> > > > - Tim >> > >> > > > >> > > > >> > >> > > > >> > > > >> > >> > > > >> > > > On Thu, Sep 4, 2014 at 1:11 PM, Tim Armstrong < >> > >> > > > >> tim.g.armstrong at gmail.com >> > >> > > > >> > > > >> > >> > > > >> > > > wrote: >> > >> > > > >> > > > >> > >> > > > >> > > > > This is all running locally on my laptop, so I >> think we >> > >> can >> > >> > > rule >> > >> > > > >> out >> > >> > > > >> > > 1). >> > >> > > > >> > > > > >> > >> > > > >> > > > > It also seems like it's a state the coaster service >> gets >> > >> into >> > >> > > > >> after a >> > >> > > > >> > > few >> > >> > > > >> > > > > client sessions: generally the first coaster run >> works >> > >> fine, >> > >> > > then >> > >> > > > >> > > after a >> > >> > > > >> > > > > few runs the problem occurs more frequently. >> > >> > > > >> > > > > >> > >> > > > >> > > > > I'm going to try and get worker logs, in the >> meantime >> > >> i've got >> > >> > > > >> some >> > >> > > > >> > > > > jstacks (attached). >> > >> > > > >> > > > > >> > >> > > > >> > > > > Matching service logs (largish) are here if needed: >> > >> > > > >> > > > > http://people.cs.uchicago.edu/~tga/service.out.gz >> > >> > > > >> > > > > >> > >> > > > >> > > > > >> > >> > > > >> > > > > On Wed, Sep 3, 2014 at 10:35 PM, Mihael Hategan < >> > >> > > > >> hategan at mcs.anl.gov> >> > >> > > > >> > > > > wrote: >> > >> > > > >> > > > > >> > >> > > > >> > > > >> Ah, makes sense. >> > >> > > > >> > > > >> >> > >> > > > >> > > > >> 2 minutes is the channel timeout. Each live >> connection >> > >> is >> > >> > > > >> guaranteed >> > >> > > > >> > > to >> > >> > > > >> > > > >> have some communication for any 2 minute time >> window, >> > >> > > partially >> > >> > > > >> due to >> > >> > > > >> > > > >> periodic heartbeats (sent every 1 minute). If no >> > >> packets flow >> > >> > > > >> for the >> > >> > > > >> > > > >> duration of 2 minutes, the connection is assumed >> broken >> > >> and >> > >> > > all >> > >> > > > >> jobs >> > >> > > > >> > > > >> that were submitted to the respective workers are >> > >> considered >> > >> > > > >> failed. >> > >> > > > >> > > So >> > >> > > > >> > > > >> there seems to be an issue with the connections to >> some >> > >> of >> > >> > > the >> > >> > > > >> > > workers, >> > >> > > > >> > > > >> and it takes 2 minutes to detect them. >> > >> > > > >> > > > >> >> > >> > > > >> > > > >> Since the service seems to be alive (although a >> jstack >> > >> on the >> > >> > > > >> service >> > >> > > > >> > > > >> when thing seem to hang might help), this leaves >> two >> > >> > > > >> possibilities: >> > >> > > > >> > > > >> 1 - some genuine network problem >> > >> > > > >> > > > >> 2 - the worker died without properly closing TCP >> > >> connections >> > >> > > > >> > > > >> >> > >> > > > >> > > > >> If (2), you could enable worker logging >> > >> > > > >> > > > >> (Settings::Key::WORKER_LOGGING_LEVEL = "DEBUG") to >> see >> > >> if >> > >> > > > >> anything >> > >> > > > >> > > shows >> > >> > > > >> > > > >> up. >> > >> > > > >> > > > >> >> > >> > > > >> > > > >> Mihael >> > >> > > > >> > > > >> >> > >> > > > >> > > > >> On Wed, 2014-09-03 at 20:26 -0500, Tim Armstrong >> wrote: >> > >> > > > >> > > > >> > Here are client and service logs, with part of >> > >> service log >> > >> > > > >> edited >> > >> > > > >> > > down >> > >> > > > >> > > > >> to >> > >> > > > >> > > > >> > be a reasonable size (I have the full thing if >> > >> needed, but >> > >> > > it >> > >> > > > >> was >> > >> > > > >> > > over a >> > >> > > > >> > > > >> > gigabyte). >> > >> > > > >> > > > >> > >> > >> > > > >> > > > >> > One relevant section is from 19:49:35 onwards. >> The >> > >> client >> > >> > > > >> submits 4 >> > >> > > > >> > > > >> jobs >> > >> > > > >> > > > >> > (its limit), but they don't complete until >> 19:51:32 >> > >> or so >> > >> > > (I >> > >> > > > >> can see >> > >> > > > >> > > > >> that >> > >> > > > >> > > > >> > one task completed based on ncompleted=1 in the >> > >> > > check_tasks log >> > >> > > > >> > > > >> message). >> > >> > > > >> > > > >> > It looks like something has happened with broken >> > >> pipes and >> > >> > > > >> workers >> > >> > > > >> > > being >> > >> > > > >> > > > >> > lost, but I'm not sure what the ultimate cause of >> > >> that is >> > >> > > > >> likely to >> > >> > > > >> > > be. >> > >> > > > >> > > > >> > >> > >> > > > >> > > > >> > - Tim >> > >> > > > >> > > > >> > >> > >> > > > >> > > > >> > >> > >> > > > >> > > > >> > >> > >> > > > >> > > > >> > On Wed, Sep 3, 2014 at 6:20 PM, Mihael Hategan < >> > >> > > > >> hategan at mcs.anl.gov >> > >> > > > >> > > > >> > >> > > > >> > > > >> wrote: >> > >> > > > >> > > > >> > >> > >> > > > >> > > > >> > > Hi Tim, >> > >> > > > >> > > > >> > > >> > >> > > > >> > > > >> > > I've never seen this before with pure Java. >> > >> > > > >> > > > >> > > >> > >> > > > >> > > > >> > > Do you have logs from these runs? >> > >> > > > >> > > > >> > > >> > >> > > > >> > > > >> > > Mihael >> > >> > > > >> > > > >> > > >> > >> > > > >> > > > >> > > On Wed, 2014-09-03 at 16:49 -0500, Tim >> Armstrong >> > >> wrote: >> > >> > > > >> > > > >> > > > I'm running a test Swift/T script that submit >> > >> tasks to >> > >> > > > >> Coasters >> > >> > > > >> > > > >> through >> > >> > > > >> > > > >> > > the >> > >> > > > >> > > > >> > > > C++ client and I'm seeing some odd behaviour >> > >> where task >> > >> > > > >> > > > >> > > > submission/execution is stalling for ~2 >> minute >> > >> periods. >> > >> > > > >> For >> > >> > > > >> > > > >> example, I'm >> > >> > > > >> > > > >> > > > seeing submit log messages like "submitting >> > >> > > > >> > > > >> > > > urn:133-1409778135377-1409778135378: >> > >> /bin/hostname" in >> > >> > > > >> bursts of >> > >> > > > >> > > > >> several >> > >> > > > >> > > > >> > > > seconds with a gap of roughly 2 minutes in >> > >> between, >> > >> > > e.g. >> > >> > > > >> I'm >> > >> > > > >> > > seeing >> > >> > > > >> > > > >> > > bursts >> > >> > > > >> > > > >> > > > with the following intervals in my logs. >> > >> > > > >> > > > >> > > > >> > >> > > > >> > > > >> > > > 16:07:04,603 to 16:07:10,391 >> > >> > > > >> > > > >> > > > 16:09:07,377 to 16:09:13,076 >> > >> > > > >> > > > >> > > > 16:11:10,005 to 16:11:16,770 >> > >> > > > >> > > > >> > > > 16:13:13,291 to 16:13:19,296 >> > >> > > > >> > > > >> > > > 16:15:16,000 to 16:15:21,602 >> > >> > > > >> > > > >> > > > >> > >> > > > >> > > > >> > > > From what I can tell, the delay is on the >> coaster >> > >> > > service >> > >> > > > >> side: >> > >> > > > >> > > the >> > >> > > > >> > > > >> C >> > >> > > > >> > > > >> > > > client is just waiting for a response. >> > >> > > > >> > > > >> > > > >> > >> > > > >> > > > >> > > > The jobs are just being submitted through the >> > >> local job >> > >> > > > >> > > manager, so >> > >> > > > >> > > > >> I >> > >> > > > >> > > > >> > > > wouldn't expect any delays there. The tasks >> are >> > >> also >> > >> > > just >> > >> > > > >> > > > >> > > "/bin/hostname", >> > >> > > > >> > > > >> > > > so should return immediately. >> > >> > > > >> > > > >> > > > >> > >> > > > >> > > > >> > > > I'm going to continue digging into this on my >> > >> own, but >> > >> > > the >> > >> > > > >> 2 >> > >> > > > >> > > minute >> > >> > > > >> > > > >> delay >> > >> > > > >> > > > >> > > > seems like a big clue: does anyone have an >> idea >> > >> what >> > >> > > could >> > >> > > > >> cause >> > >> > > > >> > > > >> stalls >> > >> > > > >> > > > >> > > in >> > >> > > > >> > > > >> > > > task submission of 2 minute duration? >> > >> > > > >> > > > >> > > > >> > >> > > > >> > > > >> > > > Cheers, >> > >> > > > >> > > > >> > > > Tim >> > >> > > > >> > > > >> > > >> > >> > > > >> > > > >> > > >> > >> > > > >> > > > >> > > >> > >> > > > >> > > > >> >> > >> > > > >> > > > >> >> > >> > > > >> > > > >> >> > >> > > > >> > > > > >> > >> > > > >> > > >> > >> > > > >> > > >> > >> > > > >> > > >> > >> > > > >> >> > >> > > > >> >> > >> > > > >> >> > >> > > > > >> > >> > > >> > >> > > >> > >> > > >> > >> >> > >> >> > >> >> > > >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: start-coaster-service.log.gz Type: application/x-gzip Size: 1370 bytes Desc: not available URL: From hategan at mcs.anl.gov Thu Sep 11 13:23:31 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 11 Sep 2014 11:23:31 -0700 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo> <1409860986.7960.3.camel@echo> <1409876838.3600.8.camel@echo> <1409939838.12288.3.camel@echo> <1410205101.24345.22.camel@echo> <1410457173.25856.5.camel@echo> Message-ID: <1410459811.27191.5.camel@echo> The coaster logging was broken, and that brokenness caused it to print everything on stdout. That got fixed, so the actual log is now in ./cps*.log. So I probably need that log. Mihael On Thu, 2014-09-11 at 13:10 -0500, Tim Armstrong wrote: > I meant the github master., but it turns out that I had had the wrong Swift > on my path. Apologies for the confusion. > > I've rerun with the current one. > > I'm getting a null pointer exception on line 226 of > BlockQueueProcessor.java. Adding some printfs revealed that settings was > null. > > Log attached. > > - Tim > > Job: Job(id:0 600.000s) > Settings: null > java.lang.NullPointerException > at > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.checkJob(BlockQueueProcessor.java:228) > at > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.enqueue1(BlockQueueProcessor.java:210) > at > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.enqueue(BlockQueueProcessor.java:204) > at > org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.enqueue(JobQueue.java:103) > at > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:96) > at > org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:112) > at > org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:589) > at > org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:175) > at org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:90) > > On Thu, Sep 11, 2014 at 12:41 PM, Tim Armstrong > wrote: > > > I thought I was running the latest trunk, I'll rebuild and see if I can > > reproduce the issue. > > > > - Tim > > > > On Thu, Sep 11, 2014 at 12:39 PM, Mihael Hategan > > wrote: > > > >> The method "getMetaChannel()" has been removed. Where did you get the > >> code from? > >> > >> Mihael > >> > >> On Thu, 2014-09-11 at 12:16 -0500, Tim Armstrong wrote: > >> > I'm seeing failures when running Swift/T tests with > >> > start-coaster-service.sh. > >> > > >> > E.g. the turbine test coaster-exec-1. I can provide instructions for > >> > running the test if needed (roughly, you need to build Swift/T with > >> coaster > >> > support enabled, then make tests/coaster-exec-1.result in the turbine > >> > directory). The github swift-t release is up to date if you want to use > >> > that. > >> > > >> > Full log is attached, stack trace excerpt is below. > >> > > >> > - Tim > >> > > >> > 2014-09-11 12:11:13,708-0500 INFO BlockQueueProcessor Starting... > >> > id=0911-1112130 > >> > Using threaded sender for TCPChannel [type: server, contact: > >> 127.0.0.1:48242 > >> > ] > >> > 2014-09-11 12:11:13,708-0500 INFO AbstractStreamCoasterChannel Using > >> > threaded sender for TCPChannel [type: server, contact: 127.0.0.1:48242] > >> > org.globus.cog.coaster.channels.ChannelException: Invalid channel: null > >> > @id://null-nullS > >> > at > >> > > >> org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) > >> > at > >> > > >> org.globus.cog.coaster.channels.ChannelManager.reserveChannel(ChannelManager.java:226) > >> > at > >> > > >> org.globus.cog.abstraction.coaster.service.job.manager.PassiveQueueProcessor.setClientChannelContext(PassiveQueueProcessor.java:41) > >> > at > >> > > >> org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.setClientChannelContext(JobQueue.java:135) > >> > at > >> > > >> org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:77) > >> > at > >> > > >> org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) > >> > at > >> > > >> org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) > >> > at > >> > > >> org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) > >> > at > >> org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) > >> > provider=local > >> > 2014-09-11 12:11:13,930-0500 INFO ExecutionTaskHandler provider=local > >> > org.globus.cog.coaster.channels.ChannelException: Invalid channel: null > >> > @id://null-nullS > >> > at > >> > > >> org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) > >> > at > >> > > >> org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:432) > >> > at > >> > > >> org.globus.cog.coaster.channels.ChannelManager.reserveLongTerm(ChannelManager.java:472) > >> > at > >> > > >> org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:80) > >> > at > >> > > >> org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) > >> > at > >> > > >> org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) > >> > at > >> > > >> org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) > >> > at > >> org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) > >> > Handler(tag: 38907, SUBMITJOB) sending error: Could not deserialize job > >> > description > >> > org.globus.cog.coaster.ProtocolException: Could not deserialize job > >> > description > >> > at > >> > > >> org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:84) > >> > at > >> > > >> org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) > >> > at > >> > > >> org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) > >> > at > >> > > >> org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) > >> > at > >> org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) > >> > Caused by: org.globus.cog.coaster.channels.ChannelException: Invalid > >> > channel: null at id://null-nullS > >> > at > >> > > >> org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) > >> > at > >> > > >> org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:432) > >> > at > >> > > >> org.globus.cog.coaster.channels.ChannelManager.reserveLongTerm(ChannelManager.java:472) > >> > at > >> > > >> org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:80) > >> > ... 4 more > >> > 2014-09-11 12:11:13,937-0500 INFO RequestReply Handler(tag: 38907, > >> > SUBMITJOB) sending error: Could not deserialize job description > >> > org.globus.cog.coaster.ProtocolException: Could not deserialize job > >> > description > >> > at > >> > > >> org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:84) > >> > at > >> > > >> org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) > >> > at > >> > > >> org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) > >> > at > >> > > >> org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) > >> > at > >> org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) > >> > Caused by: org.globus.cog.coaster.channels.ChannelException: Invalid > >> > channel: null at id://null-nullS > >> > at > >> > > >> org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) > >> > at > >> > > >> org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:432) > >> > at > >> > > >> org.globus.cog.coaster.channels.ChannelManager.reserveLongTerm(ChannelManager.java:472) > >> > at > >> > > >> org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:80) > >> > ... 4 more > >> > > >> > > >> > On Thu, Sep 11, 2014 at 10:30 AM, Tim Armstrong < > >> tim.g.armstrong at gmail.com> > >> > wrote: > >> > > >> > > This all sounds great. > >> > > > >> > > Just to check that I've understood correctly, from the client's point > >> of > >> > > view: > >> > > * The per-client settings behave the same if -shared is not provided. > >> > > * Per-client settings are ignored if -shared is provided > >> > > > >> > > I had one question: > >> > > * Do automatically allocated workers work with per-client settings? I > >> > > understand there were some issues related to sharing workers between > >> > > clients. Was the solution to have separate worker pools, or is this > >> just > >> > > not supported? > >> > > > >> > > - Tim > >> > > > >> > > On Mon, Sep 8, 2014 at 2:38 PM, Mihael Hategan > >> > > wrote: > >> > > > >> > >> So... > >> > >> > >> > >> There were bugs. Lots of bugs. > >> > >> I did some work over the weekend to fix some of these and clean up > >> the > >> > >> coaster code. Here's a summary: > >> > >> > >> > >> - there was some stuff in the low level coaster code to deal with > >> > >> persisting coaster channels over multiple connections with various > >> > >> options, like periodic connections, client or server initiated > >> > >> connections, buffering of commands, etc. None of this was used by > >> Swift, > >> > >> and the code was pretty messy. I removed that. > >> > >> - there were some issues with multiple clients: > >> > >> * improper shutdown of relevant workers when a client disconnected > >> > >> * the worker task dispatcher was a singleton and had a reference to > >> > >> one block allocator, whereas multiple clients involved multiple > >> > >> allocators. > >> > >> - there were a bunch of locking issues in the C client that valgrind > >> > >> caught > >> > >> - the idea of remote job ids was a bit hard to work with. This > >> remote id > >> > >> was the job id that the service assigned to a job. This is necessary > >> > >> because two different clients can submit jobs with the same id. The > >> > >> remote id would be communicated to the client as the reply to the > >> submit > >> > >> request. However, it was entirely possible for a notification about > >> job > >> > >> status to be sent to the client before the submit reply was. Since > >> > >> notifications were sent using the remote-id, the client would have no > >> > >> idea what job the notifications belonged to. Now, the server might > >> need > >> > >> a unique job id, but there is no reason why it cannot use the client > >> id > >> > >> when communicating the status to a client. So that's there now. > >> > >> - the way the C client was working, its jobs ended up not going to > >> the > >> > >> workers, but the local queue. The service settings now allow > >> specifying > >> > >> the provider/jobManager/url to be used to start blocks, and jobs are > >> > >> routed appropriately if they do not have the batch job flag set. > >> > >> > >> > >> I also added a shared service mode. We discussed this before. > >> Basically > >> > >> you start the coaster service with "-shared " and > >> > >> all the settings are read from that file. In this case, all clients > >> > >> share the same worker pool, and client settings are ignored. > >> > >> > >> > >> The C client now has a multi-job testing tool which can submit many > >> jobs > >> > >> with the desired level of concurrency. > >> > >> > >> > >> I have tested the C client with both shared and non-shared mode, with > >> > >> various levels of jobs being sent, with either one or two concurrent > >> > >> clients. > >> > >> > >> > >> I haven't tested manual workers. > >> > >> > >> > >> I've also decided that during normal operation (i.e. client connects, > >> > >> submits jobs, shuts down gracefully), there should be no exceptions > >> in > >> > >> the coaster log. I think we should stick to that principle. This was > >> the > >> > >> case last I tested, and we should consider any deviation from that > >> to be > >> > >> a problem. Of course, there are some things for which there is no > >> > >> graceful shut down, such as ctrl+C-ing a manual worker. Exceptions > >> are > >> > >> fine in that case. > >> > >> > >> > >> So anyway, let's start from here. > >> > >> > >> > >> Mihael > >> > >> > >> > >> On Fri, 2014-09-05 at 13:09 -0500, Tim Armstrong wrote: > >> > >> > Thanks, let me know if there's anything I can help do. > >> > >> > > >> > >> > - Tim > >> > >> > > >> > >> > > >> > >> > On Fri, Sep 5, 2014 at 12:57 PM, Mihael Hategan < > >> hategan at mcs.anl.gov> > >> > >> wrote: > >> > >> > > >> > >> > > Thanks. It also seems that there is an older bug in there in > >> which the > >> > >> > > client connection is not properly accounted for and things start > >> > >> failing > >> > >> > > two minutes after the client connects (which is also probably > >> why you > >> > >> > > didn't see this in runs with many short client connections). I'm > >> not > >> > >> > > sure why the fix for that bug isn't in the trunk code. > >> > >> > > > >> > >> > > In any event, I'll set up a client submission loop and fix all > >> these > >> > >> > > things. > >> > >> > > > >> > >> > > Mihael > >> > >> > > > >> > >> > > On Fri, 2014-09-05 at 12:13 -0500, Tim Armstrong wrote: > >> > >> > > > Ok, here it is with the additional debug messages. Source code > >> > >> change is > >> > >> > > > in commit 890c41f2ba701b10264553471590096d6f94c278. > >> > >> > > > > >> > >> > > > Warning: the tarball will expand to several gigabytes of logs. > >> > >> > > > > >> > >> > > > I had to do multiple client runs to trigger it. It seems like > >> the > >> > >> > > problem > >> > >> > > > might be triggered by abnormal termination of the client. > >> First 18 > >> > >> runs > >> > >> > > > went fine, problem only started when I ctrl-c-ed the swift/t > >> run #19 > >> > >> > > before > >> > >> > > > the run #20 that exhibited delays. > >> > >> > > > > >> > >> > > > http://people.cs.uchicago.edu/~tga/files/worker-logs3.tar.gz > >> > >> > > > > >> > >> > > > - Tim > >> > >> > > > > >> > >> > > > > >> > >> > > > On Fri, Sep 5, 2014 at 8:55 AM, Tim Armstrong < > >> > >> tim.g.armstrong at gmail.com > >> > >> > > > > >> > >> > > > wrote: > >> > >> > > > > >> > >> > > > > It's here: > >> > >> > > > > > >> > >> http://people.cs.uchicago.edu/~tga/files/coaster-service.out.full.gz > >> . > >> > >> > > > > > >> > >> > > > > I'll add some extra debug messages in the coaster C++ client > >> and > >> > >> see > >> > >> > > if I > >> > >> > > > > can recreate the scenario. > >> > >> > > > > > >> > >> > > > > - Tim > >> > >> > > > > > >> > >> > > > > > >> > >> > > > > On Thu, Sep 4, 2014 at 7:27 PM, Mihael Hategan < > >> > >> hategan at mcs.anl.gov> > >> > >> > > > > wrote: > >> > >> > > > > > >> > >> > > > >> Ok, so that's legit. > >> > >> > > > >> > >> > >> > > > >> It does look like shut down workers are not properly > >> accounted > >> > >> for in > >> > >> > > > >> some places (and I believe Yadu submitted a bug for this). > >> > >> However, I > >> > >> > > do > >> > >> > > > >> not see the dead time you mention in either of the last two > >> sets > >> > >> of > >> > >> > > > >> logs. It looks like each client instance submits a continous > >> > >> stream of > >> > >> > > > >> jobs. > >> > >> > > > >> > >> > >> > > > >> So let's get back to the initial log. Can I have the full > >> > >> service log? > >> > >> > > > >> I'm trying to track what happened with the jobs submitted > >> before > >> > >> the > >> > >> > > > >> first big pause. > >> > >> > > > >> > >> > >> > > > >> Also, a log message in CoasterClient::updateJobStatus() (or > >> > >> friends) > >> > >> > > > >> would probably help a lot here. > >> > >> > > > >> > >> > >> > > > >> Mihael > >> > >> > > > >> > >> > >> > > > >> On Thu, 2014-09-04 at 15:34 -0500, Tim Armstrong wrote: > >> > >> > > > >> > Should be here: > >> > >> > > > >> > > >> > >> > > > >> > http://people.cs.uchicago.edu/~tga/worker-logs2.tar.gz > >> > >> > > > >> > > >> > >> > > > >> > > >> > >> > > > >> > > >> > >> > > > >> > > >> > >> > > > >> > On Thu, Sep 4, 2014 at 3:03 PM, Mihael Hategan < > >> > >> hategan at mcs.anl.gov > >> > >> > > > > >> > >> > > > >> wrote: > >> > >> > > > >> > > >> > >> > > > >> > > The first worker "failing" is 0904-20022331. The log > >> looks > >> > >> funny > >> > >> > > at > >> > >> > > > >> the > >> > >> > > > >> > > end. > >> > >> > > > >> > > > >> > >> > > > >> > > Can you git pull and re-run? The worker is getting some > >> > >> command > >> > >> > > at the > >> > >> > > > >> > > end there and doing nothing about it and I wonder why. > >> > >> > > > >> > > > >> > >> > > > >> > > Mihael > >> > >> > > > >> > > > >> > >> > > > >> > > On Thu, 2014-09-04 at 14:35 -0500, Tim Armstrong wrote: > >> > >> > > > >> > > > Ok, now I have some worker logs: > >> > >> > > > >> > > > > >> > >> > > > >> > > > > >> > >> http://people.cs.uchicago.edu/~tga/2014-9-4-worker-logs.tar.gz > >> > >> > > > >> > > > > >> > >> > > > >> > > > There's nothing obvious I see in the worker logs that > >> would > >> > >> > > > >> indicate why > >> > >> > > > >> > > > the connection was broken. > >> > >> > > > >> > > > > >> > >> > > > >> > > > - Tim > >> > >> > > > >> > > > > >> > >> > > > >> > > > > >> > >> > > > >> > > > On Thu, Sep 4, 2014 at 1:11 PM, Tim Armstrong < > >> > >> > > > >> tim.g.armstrong at gmail.com > >> > >> > > > >> > > > > >> > >> > > > >> > > > wrote: > >> > >> > > > >> > > > > >> > >> > > > >> > > > > This is all running locally on my laptop, so I > >> think we > >> > >> can > >> > >> > > rule > >> > >> > > > >> out > >> > >> > > > >> > > 1). > >> > >> > > > >> > > > > > >> > >> > > > >> > > > > It also seems like it's a state the coaster service > >> gets > >> > >> into > >> > >> > > > >> after a > >> > >> > > > >> > > few > >> > >> > > > >> > > > > client sessions: generally the first coaster run > >> works > >> > >> fine, > >> > >> > > then > >> > >> > > > >> > > after a > >> > >> > > > >> > > > > few runs the problem occurs more frequently. > >> > >> > > > >> > > > > > >> > >> > > > >> > > > > I'm going to try and get worker logs, in the > >> meantime > >> > >> i've got > >> > >> > > > >> some > >> > >> > > > >> > > > > jstacks (attached). > >> > >> > > > >> > > > > > >> > >> > > > >> > > > > Matching service logs (largish) are here if needed: > >> > >> > > > >> > > > > http://people.cs.uchicago.edu/~tga/service.out.gz > >> > >> > > > >> > > > > > >> > >> > > > >> > > > > > >> > >> > > > >> > > > > On Wed, Sep 3, 2014 at 10:35 PM, Mihael Hategan < > >> > >> > > > >> hategan at mcs.anl.gov> > >> > >> > > > >> > > > > wrote: > >> > >> > > > >> > > > > > >> > >> > > > >> > > > >> Ah, makes sense. > >> > >> > > > >> > > > >> > >> > >> > > > >> > > > >> 2 minutes is the channel timeout. Each live > >> connection > >> > >> is > >> > >> > > > >> guaranteed > >> > >> > > > >> > > to > >> > >> > > > >> > > > >> have some communication for any 2 minute time > >> window, > >> > >> > > partially > >> > >> > > > >> due to > >> > >> > > > >> > > > >> periodic heartbeats (sent every 1 minute). If no > >> > >> packets flow > >> > >> > > > >> for the > >> > >> > > > >> > > > >> duration of 2 minutes, the connection is assumed > >> broken > >> > >> and > >> > >> > > all > >> > >> > > > >> jobs > >> > >> > > > >> > > > >> that were submitted to the respective workers are > >> > >> considered > >> > >> > > > >> failed. > >> > >> > > > >> > > So > >> > >> > > > >> > > > >> there seems to be an issue with the connections to > >> some > >> > >> of > >> > >> > > the > >> > >> > > > >> > > workers, > >> > >> > > > >> > > > >> and it takes 2 minutes to detect them. > >> > >> > > > >> > > > >> > >> > >> > > > >> > > > >> Since the service seems to be alive (although a > >> jstack > >> > >> on the > >> > >> > > > >> service > >> > >> > > > >> > > > >> when thing seem to hang might help), this leaves > >> two > >> > >> > > > >> possibilities: > >> > >> > > > >> > > > >> 1 - some genuine network problem > >> > >> > > > >> > > > >> 2 - the worker died without properly closing TCP > >> > >> connections > >> > >> > > > >> > > > >> > >> > >> > > > >> > > > >> If (2), you could enable worker logging > >> > >> > > > >> > > > >> (Settings::Key::WORKER_LOGGING_LEVEL = "DEBUG") to > >> see > >> > >> if > >> > >> > > > >> anything > >> > >> > > > >> > > shows > >> > >> > > > >> > > > >> up. > >> > >> > > > >> > > > >> > >> > >> > > > >> > > > >> Mihael > >> > >> > > > >> > > > >> > >> > >> > > > >> > > > >> On Wed, 2014-09-03 at 20:26 -0500, Tim Armstrong > >> wrote: > >> > >> > > > >> > > > >> > Here are client and service logs, with part of > >> > >> service log > >> > >> > > > >> edited > >> > >> > > > >> > > down > >> > >> > > > >> > > > >> to > >> > >> > > > >> > > > >> > be a reasonable size (I have the full thing if > >> > >> needed, but > >> > >> > > it > >> > >> > > > >> was > >> > >> > > > >> > > over a > >> > >> > > > >> > > > >> > gigabyte). > >> > >> > > > >> > > > >> > > >> > >> > > > >> > > > >> > One relevant section is from 19:49:35 onwards. > >> The > >> > >> client > >> > >> > > > >> submits 4 > >> > >> > > > >> > > > >> jobs > >> > >> > > > >> > > > >> > (its limit), but they don't complete until > >> 19:51:32 > >> > >> or so > >> > >> > > (I > >> > >> > > > >> can see > >> > >> > > > >> > > > >> that > >> > >> > > > >> > > > >> > one task completed based on ncompleted=1 in the > >> > >> > > check_tasks log > >> > >> > > > >> > > > >> message). > >> > >> > > > >> > > > >> > It looks like something has happened with broken > >> > >> pipes and > >> > >> > > > >> workers > >> > >> > > > >> > > being > >> > >> > > > >> > > > >> > lost, but I'm not sure what the ultimate cause of > >> > >> that is > >> > >> > > > >> likely to > >> > >> > > > >> > > be. > >> > >> > > > >> > > > >> > > >> > >> > > > >> > > > >> > - Tim > >> > >> > > > >> > > > >> > > >> > >> > > > >> > > > >> > > >> > >> > > > >> > > > >> > > >> > >> > > > >> > > > >> > On Wed, Sep 3, 2014 at 6:20 PM, Mihael Hategan < > >> > >> > > > >> hategan at mcs.anl.gov > >> > >> > > > >> > > > > >> > >> > > > >> > > > >> wrote: > >> > >> > > > >> > > > >> > > >> > >> > > > >> > > > >> > > Hi Tim, > >> > >> > > > >> > > > >> > > > >> > >> > > > >> > > > >> > > I've never seen this before with pure Java. > >> > >> > > > >> > > > >> > > > >> > >> > > > >> > > > >> > > Do you have logs from these runs? > >> > >> > > > >> > > > >> > > > >> > >> > > > >> > > > >> > > Mihael > >> > >> > > > >> > > > >> > > > >> > >> > > > >> > > > >> > > On Wed, 2014-09-03 at 16:49 -0500, Tim > >> Armstrong > >> > >> wrote: > >> > >> > > > >> > > > >> > > > I'm running a test Swift/T script that submit > >> > >> tasks to > >> > >> > > > >> Coasters > >> > >> > > > >> > > > >> through > >> > >> > > > >> > > > >> > > the > >> > >> > > > >> > > > >> > > > C++ client and I'm seeing some odd behaviour > >> > >> where task > >> > >> > > > >> > > > >> > > > submission/execution is stalling for ~2 > >> minute > >> > >> periods. > >> > >> > > > >> For > >> > >> > > > >> > > > >> example, I'm > >> > >> > > > >> > > > >> > > > seeing submit log messages like "submitting > >> > >> > > > >> > > > >> > > > urn:133-1409778135377-1409778135378: > >> > >> /bin/hostname" in > >> > >> > > > >> bursts of > >> > >> > > > >> > > > >> several > >> > >> > > > >> > > > >> > > > seconds with a gap of roughly 2 minutes in > >> > >> between, > >> > >> > > e.g. > >> > >> > > > >> I'm > >> > >> > > > >> > > seeing > >> > >> > > > >> > > > >> > > bursts > >> > >> > > > >> > > > >> > > > with the following intervals in my logs. > >> > >> > > > >> > > > >> > > > > >> > >> > > > >> > > > >> > > > 16:07:04,603 to 16:07:10,391 > >> > >> > > > >> > > > >> > > > 16:09:07,377 to 16:09:13,076 > >> > >> > > > >> > > > >> > > > 16:11:10,005 to 16:11:16,770 > >> > >> > > > >> > > > >> > > > 16:13:13,291 to 16:13:19,296 > >> > >> > > > >> > > > >> > > > 16:15:16,000 to 16:15:21,602 > >> > >> > > > >> > > > >> > > > > >> > >> > > > >> > > > >> > > > From what I can tell, the delay is on the > >> coaster > >> > >> > > service > >> > >> > > > >> side: > >> > >> > > > >> > > the > >> > >> > > > >> > > > >> C > >> > >> > > > >> > > > >> > > > client is just waiting for a response. > >> > >> > > > >> > > > >> > > > > >> > >> > > > >> > > > >> > > > The jobs are just being submitted through the > >> > >> local job > >> > >> > > > >> > > manager, so > >> > >> > > > >> > > > >> I > >> > >> > > > >> > > > >> > > > wouldn't expect any delays there. The tasks > >> are > >> > >> also > >> > >> > > just > >> > >> > > > >> > > > >> > > "/bin/hostname", > >> > >> > > > >> > > > >> > > > so should return immediately. > >> > >> > > > >> > > > >> > > > > >> > >> > > > >> > > > >> > > > I'm going to continue digging into this on my > >> > >> own, but > >> > >> > > the > >> > >> > > > >> 2 > >> > >> > > > >> > > minute > >> > >> > > > >> > > > >> delay > >> > >> > > > >> > > > >> > > > seems like a big clue: does anyone have an > >> idea > >> > >> what > >> > >> > > could > >> > >> > > > >> cause > >> > >> > > > >> > > > >> stalls > >> > >> > > > >> > > > >> > > in > >> > >> > > > >> > > > >> > > > task submission of 2 minute duration? > >> > >> > > > >> > > > >> > > > > >> > >> > > > >> > > > >> > > > Cheers, > >> > >> > > > >> > > > >> > > > Tim > >> > >> > > > >> > > > >> > > > >> > >> > > > >> > > > >> > > > >> > >> > > > >> > > > >> > > > >> > >> > > > >> > > > >> > >> > >> > > > >> > > > >> > >> > >> > > > >> > > > >> > >> > >> > > > >> > > > > > >> > >> > > > >> > > > >> > >> > > > >> > > > >> > >> > > > >> > > > >> > >> > > > >> > >> > >> > > > >> > >> > >> > > > >> > >> > >> > > > > > >> > >> > > > >> > >> > > > >> > >> > > > >> > >> > >> > >> > >> > >> > >> > > > >> > >> > >> > > From tim.g.armstrong at gmail.com Thu Sep 11 13:26:52 2014 From: tim.g.armstrong at gmail.com (Tim Armstrong) Date: Thu, 11 Sep 2014 13:26:52 -0500 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: <1410459811.27191.5.camel@echo> References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo> <1409860986.7960.3.camel@echo> <1409876838.3600.8.camel@echo> <1409939838.12288.3.camel@echo> <1410205101.24345.22.camel@echo> <1410457173.25856.5.camel@echo> <1410459811.27191.5.camel@echo> Message-ID: Oops, forgot about that On Thu, Sep 11, 2014 at 1:23 PM, Mihael Hategan wrote: > The coaster logging was broken, and that brokenness caused it to print > everything on stdout. That got fixed, so the actual log is now > in ./cps*.log. > > So I probably need that log. > > Mihael > > On Thu, 2014-09-11 at 13:10 -0500, Tim Armstrong wrote: > > I meant the github master., but it turns out that I had had the wrong > Swift > > on my path. Apologies for the confusion. > > > > I've rerun with the current one. > > > > I'm getting a null pointer exception on line 226 of > > BlockQueueProcessor.java. Adding some printfs revealed that settings was > > null. > > > > Log attached. > > > > - Tim > > > > Job: Job(id:0 600.000s) > > Settings: null > > java.lang.NullPointerException > > at > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.checkJob(BlockQueueProcessor.java:228) > > at > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.enqueue1(BlockQueueProcessor.java:210) > > at > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.enqueue(BlockQueueProcessor.java:204) > > at > > > org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.enqueue(JobQueue.java:103) > > at > > > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:96) > > at > > > org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:112) > > at > > > org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:589) > > at > > > org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:175) > > at > org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:90) > > > > On Thu, Sep 11, 2014 at 12:41 PM, Tim Armstrong < > tim.g.armstrong at gmail.com> > > wrote: > > > > > I thought I was running the latest trunk, I'll rebuild and see if I can > > > reproduce the issue. > > > > > > - Tim > > > > > > On Thu, Sep 11, 2014 at 12:39 PM, Mihael Hategan > > > wrote: > > > > > >> The method "getMetaChannel()" has been removed. Where did you get the > > >> code from? > > >> > > >> Mihael > > >> > > >> On Thu, 2014-09-11 at 12:16 -0500, Tim Armstrong wrote: > > >> > I'm seeing failures when running Swift/T tests with > > >> > start-coaster-service.sh. > > >> > > > >> > E.g. the turbine test coaster-exec-1. I can provide instructions > for > > >> > running the test if needed (roughly, you need to build Swift/T with > > >> coaster > > >> > support enabled, then make tests/coaster-exec-1.result in the > turbine > > >> > directory). The github swift-t release is up to date if you want > to use > > >> > that. > > >> > > > >> > Full log is attached, stack trace excerpt is below. > > >> > > > >> > - Tim > > >> > > > >> > 2014-09-11 12:11:13,708-0500 INFO BlockQueueProcessor Starting... > > >> > id=0911-1112130 > > >> > Using threaded sender for TCPChannel [type: server, contact: > > >> 127.0.0.1:48242 > > >> > ] > > >> > 2014-09-11 12:11:13,708-0500 INFO AbstractStreamCoasterChannel > Using > > >> > threaded sender for TCPChannel [type: server, contact: > 127.0.0.1:48242] > > >> > org.globus.cog.coaster.channels.ChannelException: Invalid channel: > null > > >> > @id://null-nullS > > >> > at > > >> > > > >> > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) > > >> > at > > >> > > > >> > org.globus.cog.coaster.channels.ChannelManager.reserveChannel(ChannelManager.java:226) > > >> > at > > >> > > > >> > org.globus.cog.abstraction.coaster.service.job.manager.PassiveQueueProcessor.setClientChannelContext(PassiveQueueProcessor.java:41) > > >> > at > > >> > > > >> > org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.setClientChannelContext(JobQueue.java:135) > > >> > at > > >> > > > >> > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:77) > > >> > at > > >> > > > >> > org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) > > >> > at > > >> > > > >> > org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) > > >> > at > > >> > > > >> > org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) > > >> > at > > >> org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) > > >> > provider=local > > >> > 2014-09-11 12:11:13,930-0500 INFO ExecutionTaskHandler > provider=local > > >> > org.globus.cog.coaster.channels.ChannelException: Invalid channel: > null > > >> > @id://null-nullS > > >> > at > > >> > > > >> > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) > > >> > at > > >> > > > >> > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:432) > > >> > at > > >> > > > >> > org.globus.cog.coaster.channels.ChannelManager.reserveLongTerm(ChannelManager.java:472) > > >> > at > > >> > > > >> > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:80) > > >> > at > > >> > > > >> > org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) > > >> > at > > >> > > > >> > org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) > > >> > at > > >> > > > >> > org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) > > >> > at > > >> org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) > > >> > Handler(tag: 38907, SUBMITJOB) sending error: Could not deserialize > job > > >> > description > > >> > org.globus.cog.coaster.ProtocolException: Could not deserialize job > > >> > description > > >> > at > > >> > > > >> > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:84) > > >> > at > > >> > > > >> > org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) > > >> > at > > >> > > > >> > org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) > > >> > at > > >> > > > >> > org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) > > >> > at > > >> org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) > > >> > Caused by: org.globus.cog.coaster.channels.ChannelException: Invalid > > >> > channel: null at id://null-nullS > > >> > at > > >> > > > >> > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) > > >> > at > > >> > > > >> > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:432) > > >> > at > > >> > > > >> > org.globus.cog.coaster.channels.ChannelManager.reserveLongTerm(ChannelManager.java:472) > > >> > at > > >> > > > >> > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:80) > > >> > ... 4 more > > >> > 2014-09-11 12:11:13,937-0500 INFO RequestReply Handler(tag: 38907, > > >> > SUBMITJOB) sending error: Could not deserialize job description > > >> > org.globus.cog.coaster.ProtocolException: Could not deserialize job > > >> > description > > >> > at > > >> > > > >> > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:84) > > >> > at > > >> > > > >> > org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) > > >> > at > > >> > > > >> > org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) > > >> > at > > >> > > > >> > org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) > > >> > at > > >> org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) > > >> > Caused by: org.globus.cog.coaster.channels.ChannelException: Invalid > > >> > channel: null at id://null-nullS > > >> > at > > >> > > > >> > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) > > >> > at > > >> > > > >> > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:432) > > >> > at > > >> > > > >> > org.globus.cog.coaster.channels.ChannelManager.reserveLongTerm(ChannelManager.java:472) > > >> > at > > >> > > > >> > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:80) > > >> > ... 4 more > > >> > > > >> > > > >> > On Thu, Sep 11, 2014 at 10:30 AM, Tim Armstrong < > > >> tim.g.armstrong at gmail.com> > > >> > wrote: > > >> > > > >> > > This all sounds great. > > >> > > > > >> > > Just to check that I've understood correctly, from the client's > point > > >> of > > >> > > view: > > >> > > * The per-client settings behave the same if -shared is not > provided. > > >> > > * Per-client settings are ignored if -shared is provided > > >> > > > > >> > > I had one question: > > >> > > * Do automatically allocated workers work with per-client > settings? I > > >> > > understand there were some issues related to sharing workers > between > > >> > > clients. Was the solution to have separate worker pools, or is > this > > >> just > > >> > > not supported? > > >> > > > > >> > > - Tim > > >> > > > > >> > > On Mon, Sep 8, 2014 at 2:38 PM, Mihael Hategan < > hategan at mcs.anl.gov> > > >> > > wrote: > > >> > > > > >> > >> So... > > >> > >> > > >> > >> There were bugs. Lots of bugs. > > >> > >> I did some work over the weekend to fix some of these and clean > up > > >> the > > >> > >> coaster code. Here's a summary: > > >> > >> > > >> > >> - there was some stuff in the low level coaster code to deal with > > >> > >> persisting coaster channels over multiple connections with > various > > >> > >> options, like periodic connections, client or server initiated > > >> > >> connections, buffering of commands, etc. None of this was used by > > >> Swift, > > >> > >> and the code was pretty messy. I removed that. > > >> > >> - there were some issues with multiple clients: > > >> > >> * improper shutdown of relevant workers when a client > disconnected > > >> > >> * the worker task dispatcher was a singleton and had a > reference to > > >> > >> one block allocator, whereas multiple clients involved multiple > > >> > >> allocators. > > >> > >> - there were a bunch of locking issues in the C client that > valgrind > > >> > >> caught > > >> > >> - the idea of remote job ids was a bit hard to work with. This > > >> remote id > > >> > >> was the job id that the service assigned to a job. This is > necessary > > >> > >> because two different clients can submit jobs with the same id. > The > > >> > >> remote id would be communicated to the client as the reply to the > > >> submit > > >> > >> request. However, it was entirely possible for a notification > about > > >> job > > >> > >> status to be sent to the client before the submit reply was. > Since > > >> > >> notifications were sent using the remote-id, the client would > have no > > >> > >> idea what job the notifications belonged to. Now, the server > might > > >> need > > >> > >> a unique job id, but there is no reason why it cannot use the > client > > >> id > > >> > >> when communicating the status to a client. So that's there now. > > >> > >> - the way the C client was working, its jobs ended up not going > to > > >> the > > >> > >> workers, but the local queue. The service settings now allow > > >> specifying > > >> > >> the provider/jobManager/url to be used to start blocks, and jobs > are > > >> > >> routed appropriately if they do not have the batch job flag set. > > >> > >> > > >> > >> I also added a shared service mode. We discussed this before. > > >> Basically > > >> > >> you start the coaster service with "-shared > " and > > >> > >> all the settings are read from that file. In this case, all > clients > > >> > >> share the same worker pool, and client settings are ignored. > > >> > >> > > >> > >> The C client now has a multi-job testing tool which can submit > many > > >> jobs > > >> > >> with the desired level of concurrency. > > >> > >> > > >> > >> I have tested the C client with both shared and non-shared mode, > with > > >> > >> various levels of jobs being sent, with either one or two > concurrent > > >> > >> clients. > > >> > >> > > >> > >> I haven't tested manual workers. > > >> > >> > > >> > >> I've also decided that during normal operation (i.e. client > connects, > > >> > >> submits jobs, shuts down gracefully), there should be no > exceptions > > >> in > > >> > >> the coaster log. I think we should stick to that principle. This > was > > >> the > > >> > >> case last I tested, and we should consider any deviation from > that > > >> to be > > >> > >> a problem. Of course, there are some things for which there is no > > >> > >> graceful shut down, such as ctrl+C-ing a manual worker. > Exceptions > > >> are > > >> > >> fine in that case. > > >> > >> > > >> > >> So anyway, let's start from here. > > >> > >> > > >> > >> Mihael > > >> > >> > > >> > >> On Fri, 2014-09-05 at 13:09 -0500, Tim Armstrong wrote: > > >> > >> > Thanks, let me know if there's anything I can help do. > > >> > >> > > > >> > >> > - Tim > > >> > >> > > > >> > >> > > > >> > >> > On Fri, Sep 5, 2014 at 12:57 PM, Mihael Hategan < > > >> hategan at mcs.anl.gov> > > >> > >> wrote: > > >> > >> > > > >> > >> > > Thanks. It also seems that there is an older bug in there in > > >> which the > > >> > >> > > client connection is not properly accounted for and things > start > > >> > >> failing > > >> > >> > > two minutes after the client connects (which is also probably > > >> why you > > >> > >> > > didn't see this in runs with many short client connections). > I'm > > >> not > > >> > >> > > sure why the fix for that bug isn't in the trunk code. > > >> > >> > > > > >> > >> > > In any event, I'll set up a client submission loop and fix > all > > >> these > > >> > >> > > things. > > >> > >> > > > > >> > >> > > Mihael > > >> > >> > > > > >> > >> > > On Fri, 2014-09-05 at 12:13 -0500, Tim Armstrong wrote: > > >> > >> > > > Ok, here it is with the additional debug messages. Source > code > > >> > >> change is > > >> > >> > > > in commit 890c41f2ba701b10264553471590096d6f94c278. > > >> > >> > > > > > >> > >> > > > Warning: the tarball will expand to several gigabytes of > logs. > > >> > >> > > > > > >> > >> > > > I had to do multiple client runs to trigger it. It seems > like > > >> the > > >> > >> > > problem > > >> > >> > > > might be triggered by abnormal termination of the client. > > >> First 18 > > >> > >> runs > > >> > >> > > > went fine, problem only started when I ctrl-c-ed the > swift/t > > >> run #19 > > >> > >> > > before > > >> > >> > > > the run #20 that exhibited delays. > > >> > >> > > > > > >> > >> > > > > http://people.cs.uchicago.edu/~tga/files/worker-logs3.tar.gz > > >> > >> > > > > > >> > >> > > > - Tim > > >> > >> > > > > > >> > >> > > > > > >> > >> > > > On Fri, Sep 5, 2014 at 8:55 AM, Tim Armstrong < > > >> > >> tim.g.armstrong at gmail.com > > >> > >> > > > > > >> > >> > > > wrote: > > >> > >> > > > > > >> > >> > > > > It's here: > > >> > >> > > > > > > >> > >> > http://people.cs.uchicago.edu/~tga/files/coaster-service.out.full.gz > > >> . > > >> > >> > > > > > > >> > >> > > > > I'll add some extra debug messages in the coaster C++ > client > > >> and > > >> > >> see > > >> > >> > > if I > > >> > >> > > > > can recreate the scenario. > > >> > >> > > > > > > >> > >> > > > > - Tim > > >> > >> > > > > > > >> > >> > > > > > > >> > >> > > > > On Thu, Sep 4, 2014 at 7:27 PM, Mihael Hategan < > > >> > >> hategan at mcs.anl.gov> > > >> > >> > > > > wrote: > > >> > >> > > > > > > >> > >> > > > >> Ok, so that's legit. > > >> > >> > > > >> > > >> > >> > > > >> It does look like shut down workers are not properly > > >> accounted > > >> > >> for in > > >> > >> > > > >> some places (and I believe Yadu submitted a bug for > this). > > >> > >> However, I > > >> > >> > > do > > >> > >> > > > >> not see the dead time you mention in either of the last > two > > >> sets > > >> > >> of > > >> > >> > > > >> logs. It looks like each client instance submits a > continous > > >> > >> stream of > > >> > >> > > > >> jobs. > > >> > >> > > > >> > > >> > >> > > > >> So let's get back to the initial log. Can I have the > full > > >> > >> service log? > > >> > >> > > > >> I'm trying to track what happened with the jobs > submitted > > >> before > > >> > >> the > > >> > >> > > > >> first big pause. > > >> > >> > > > >> > > >> > >> > > > >> Also, a log message in CoasterClient::updateJobStatus() > (or > > >> > >> friends) > > >> > >> > > > >> would probably help a lot here. > > >> > >> > > > >> > > >> > >> > > > >> Mihael > > >> > >> > > > >> > > >> > >> > > > >> On Thu, 2014-09-04 at 15:34 -0500, Tim Armstrong wrote: > > >> > >> > > > >> > Should be here: > > >> > >> > > > >> > > > >> > >> > > > >> > > http://people.cs.uchicago.edu/~tga/worker-logs2.tar.gz > > >> > >> > > > >> > > > >> > >> > > > >> > > > >> > >> > > > >> > > > >> > >> > > > >> > > > >> > >> > > > >> > On Thu, Sep 4, 2014 at 3:03 PM, Mihael Hategan < > > >> > >> hategan at mcs.anl.gov > > >> > >> > > > > > >> > >> > > > >> wrote: > > >> > >> > > > >> > > > >> > >> > > > >> > > The first worker "failing" is 0904-20022331. The log > > >> looks > > >> > >> funny > > >> > >> > > at > > >> > >> > > > >> the > > >> > >> > > > >> > > end. > > >> > >> > > > >> > > > > >> > >> > > > >> > > Can you git pull and re-run? The worker is getting > some > > >> > >> command > > >> > >> > > at the > > >> > >> > > > >> > > end there and doing nothing about it and I wonder > why. > > >> > >> > > > >> > > > > >> > >> > > > >> > > Mihael > > >> > >> > > > >> > > > > >> > >> > > > >> > > On Thu, 2014-09-04 at 14:35 -0500, Tim Armstrong > wrote: > > >> > >> > > > >> > > > Ok, now I have some worker logs: > > >> > >> > > > >> > > > > > >> > >> > > > >> > > > > > >> > >> http://people.cs.uchicago.edu/~tga/2014-9-4-worker-logs.tar.gz > > >> > >> > > > >> > > > > > >> > >> > > > >> > > > There's nothing obvious I see in the worker logs > that > > >> would > > >> > >> > > > >> indicate why > > >> > >> > > > >> > > > the connection was broken. > > >> > >> > > > >> > > > > > >> > >> > > > >> > > > - Tim > > >> > >> > > > >> > > > > > >> > >> > > > >> > > > > > >> > >> > > > >> > > > On Thu, Sep 4, 2014 at 1:11 PM, Tim Armstrong < > > >> > >> > > > >> tim.g.armstrong at gmail.com > > >> > >> > > > >> > > > > > >> > >> > > > >> > > > wrote: > > >> > >> > > > >> > > > > > >> > >> > > > >> > > > > This is all running locally on my laptop, so I > > >> think we > > >> > >> can > > >> > >> > > rule > > >> > >> > > > >> out > > >> > >> > > > >> > > 1). > > >> > >> > > > >> > > > > > > >> > >> > > > >> > > > > It also seems like it's a state the coaster > service > > >> gets > > >> > >> into > > >> > >> > > > >> after a > > >> > >> > > > >> > > few > > >> > >> > > > >> > > > > client sessions: generally the first coaster run > > >> works > > >> > >> fine, > > >> > >> > > then > > >> > >> > > > >> > > after a > > >> > >> > > > >> > > > > few runs the problem occurs more frequently. > > >> > >> > > > >> > > > > > > >> > >> > > > >> > > > > I'm going to try and get worker logs, in the > > >> meantime > > >> > >> i've got > > >> > >> > > > >> some > > >> > >> > > > >> > > > > jstacks (attached). > > >> > >> > > > >> > > > > > > >> > >> > > > >> > > > > Matching service logs (largish) are here if > needed: > > >> > >> > > > >> > > > > > http://people.cs.uchicago.edu/~tga/service.out.gz > > >> > >> > > > >> > > > > > > >> > >> > > > >> > > > > > > >> > >> > > > >> > > > > On Wed, Sep 3, 2014 at 10:35 PM, Mihael Hategan > < > > >> > >> > > > >> hategan at mcs.anl.gov> > > >> > >> > > > >> > > > > wrote: > > >> > >> > > > >> > > > > > > >> > >> > > > >> > > > >> Ah, makes sense. > > >> > >> > > > >> > > > >> > > >> > >> > > > >> > > > >> 2 minutes is the channel timeout. Each live > > >> connection > > >> > >> is > > >> > >> > > > >> guaranteed > > >> > >> > > > >> > > to > > >> > >> > > > >> > > > >> have some communication for any 2 minute time > > >> window, > > >> > >> > > partially > > >> > >> > > > >> due to > > >> > >> > > > >> > > > >> periodic heartbeats (sent every 1 minute). If > no > > >> > >> packets flow > > >> > >> > > > >> for the > > >> > >> > > > >> > > > >> duration of 2 minutes, the connection is > assumed > > >> broken > > >> > >> and > > >> > >> > > all > > >> > >> > > > >> jobs > > >> > >> > > > >> > > > >> that were submitted to the respective workers > are > > >> > >> considered > > >> > >> > > > >> failed. > > >> > >> > > > >> > > So > > >> > >> > > > >> > > > >> there seems to be an issue with the > connections to > > >> some > > >> > >> of > > >> > >> > > the > > >> > >> > > > >> > > workers, > > >> > >> > > > >> > > > >> and it takes 2 minutes to detect them. > > >> > >> > > > >> > > > >> > > >> > >> > > > >> > > > >> Since the service seems to be alive (although a > > >> jstack > > >> > >> on the > > >> > >> > > > >> service > > >> > >> > > > >> > > > >> when thing seem to hang might help), this > leaves > > >> two > > >> > >> > > > >> possibilities: > > >> > >> > > > >> > > > >> 1 - some genuine network problem > > >> > >> > > > >> > > > >> 2 - the worker died without properly closing > TCP > > >> > >> connections > > >> > >> > > > >> > > > >> > > >> > >> > > > >> > > > >> If (2), you could enable worker logging > > >> > >> > > > >> > > > >> (Settings::Key::WORKER_LOGGING_LEVEL = > "DEBUG") to > > >> see > > >> > >> if > > >> > >> > > > >> anything > > >> > >> > > > >> > > shows > > >> > >> > > > >> > > > >> up. > > >> > >> > > > >> > > > >> > > >> > >> > > > >> > > > >> Mihael > > >> > >> > > > >> > > > >> > > >> > >> > > > >> > > > >> On Wed, 2014-09-03 at 20:26 -0500, Tim > Armstrong > > >> wrote: > > >> > >> > > > >> > > > >> > Here are client and service logs, with part > of > > >> > >> service log > > >> > >> > > > >> edited > > >> > >> > > > >> > > down > > >> > >> > > > >> > > > >> to > > >> > >> > > > >> > > > >> > be a reasonable size (I have the full thing > if > > >> > >> needed, but > > >> > >> > > it > > >> > >> > > > >> was > > >> > >> > > > >> > > over a > > >> > >> > > > >> > > > >> > gigabyte). > > >> > >> > > > >> > > > >> > > > >> > >> > > > >> > > > >> > One relevant section is from 19:49:35 > onwards. > > >> The > > >> > >> client > > >> > >> > > > >> submits 4 > > >> > >> > > > >> > > > >> jobs > > >> > >> > > > >> > > > >> > (its limit), but they don't complete until > > >> 19:51:32 > > >> > >> or so > > >> > >> > > (I > > >> > >> > > > >> can see > > >> > >> > > > >> > > > >> that > > >> > >> > > > >> > > > >> > one task completed based on ncompleted=1 in > the > > >> > >> > > check_tasks log > > >> > >> > > > >> > > > >> message). > > >> > >> > > > >> > > > >> > It looks like something has happened with > broken > > >> > >> pipes and > > >> > >> > > > >> workers > > >> > >> > > > >> > > being > > >> > >> > > > >> > > > >> > lost, but I'm not sure what the ultimate > cause of > > >> > >> that is > > >> > >> > > > >> likely to > > >> > >> > > > >> > > be. > > >> > >> > > > >> > > > >> > > > >> > >> > > > >> > > > >> > - Tim > > >> > >> > > > >> > > > >> > > > >> > >> > > > >> > > > >> > > > >> > >> > > > >> > > > >> > > > >> > >> > > > >> > > > >> > On Wed, Sep 3, 2014 at 6:20 PM, Mihael > Hategan < > > >> > >> > > > >> hategan at mcs.anl.gov > > >> > >> > > > >> > > > > > >> > >> > > > >> > > > >> wrote: > > >> > >> > > > >> > > > >> > > > >> > >> > > > >> > > > >> > > Hi Tim, > > >> > >> > > > >> > > > >> > > > > >> > >> > > > >> > > > >> > > I've never seen this before with pure Java. > > >> > >> > > > >> > > > >> > > > > >> > >> > > > >> > > > >> > > Do you have logs from these runs? > > >> > >> > > > >> > > > >> > > > > >> > >> > > > >> > > > >> > > Mihael > > >> > >> > > > >> > > > >> > > > > >> > >> > > > >> > > > >> > > On Wed, 2014-09-03 at 16:49 -0500, Tim > > >> Armstrong > > >> > >> wrote: > > >> > >> > > > >> > > > >> > > > I'm running a test Swift/T script that > submit > > >> > >> tasks to > > >> > >> > > > >> Coasters > > >> > >> > > > >> > > > >> through > > >> > >> > > > >> > > > >> > > the > > >> > >> > > > >> > > > >> > > > C++ client and I'm seeing some odd > behaviour > > >> > >> where task > > >> > >> > > > >> > > > >> > > > submission/execution is stalling for ~2 > > >> minute > > >> > >> periods. > > >> > >> > > > >> For > > >> > >> > > > >> > > > >> example, I'm > > >> > >> > > > >> > > > >> > > > seeing submit log messages like > "submitting > > >> > >> > > > >> > > > >> > > > urn:133-1409778135377-1409778135378: > > >> > >> /bin/hostname" in > > >> > >> > > > >> bursts of > > >> > >> > > > >> > > > >> several > > >> > >> > > > >> > > > >> > > > seconds with a gap of roughly 2 minutes > in > > >> > >> between, > > >> > >> > > e.g. > > >> > >> > > > >> I'm > > >> > >> > > > >> > > seeing > > >> > >> > > > >> > > > >> > > bursts > > >> > >> > > > >> > > > >> > > > with the following intervals in my logs. > > >> > >> > > > >> > > > >> > > > > > >> > >> > > > >> > > > >> > > > 16:07:04,603 to 16:07:10,391 > > >> > >> > > > >> > > > >> > > > 16:09:07,377 to 16:09:13,076 > > >> > >> > > > >> > > > >> > > > 16:11:10,005 to 16:11:16,770 > > >> > >> > > > >> > > > >> > > > 16:13:13,291 to 16:13:19,296 > > >> > >> > > > >> > > > >> > > > 16:15:16,000 to 16:15:21,602 > > >> > >> > > > >> > > > >> > > > > > >> > >> > > > >> > > > >> > > > From what I can tell, the delay is on the > > >> coaster > > >> > >> > > service > > >> > >> > > > >> side: > > >> > >> > > > >> > > the > > >> > >> > > > >> > > > >> C > > >> > >> > > > >> > > > >> > > > client is just waiting for a response. > > >> > >> > > > >> > > > >> > > > > > >> > >> > > > >> > > > >> > > > The jobs are just being submitted > through the > > >> > >> local job > > >> > >> > > > >> > > manager, so > > >> > >> > > > >> > > > >> I > > >> > >> > > > >> > > > >> > > > wouldn't expect any delays there. The > tasks > > >> are > > >> > >> also > > >> > >> > > just > > >> > >> > > > >> > > > >> > > "/bin/hostname", > > >> > >> > > > >> > > > >> > > > so should return immediately. > > >> > >> > > > >> > > > >> > > > > > >> > >> > > > >> > > > >> > > > I'm going to continue digging into this > on my > > >> > >> own, but > > >> > >> > > the > > >> > >> > > > >> 2 > > >> > >> > > > >> > > minute > > >> > >> > > > >> > > > >> delay > > >> > >> > > > >> > > > >> > > > seems like a big clue: does anyone have > an > > >> idea > > >> > >> what > > >> > >> > > could > > >> > >> > > > >> cause > > >> > >> > > > >> > > > >> stalls > > >> > >> > > > >> > > > >> > > in > > >> > >> > > > >> > > > >> > > > task submission of 2 minute duration? > > >> > >> > > > >> > > > >> > > > > > >> > >> > > > >> > > > >> > > > Cheers, > > >> > >> > > > >> > > > >> > > > Tim > > >> > >> > > > >> > > > >> > > > > >> > >> > > > >> > > > >> > > > > >> > >> > > > >> > > > >> > > > > >> > >> > > > >> > > > >> > > >> > >> > > > >> > > > >> > > >> > >> > > > >> > > > >> > > >> > >> > > > >> > > > > > > >> > >> > > > >> > > > > >> > >> > > > >> > > > > >> > >> > > > >> > > > > >> > >> > > > >> > > >> > >> > > > >> > > >> > >> > > > >> > > >> > >> > > > > > > >> > >> > > > > >> > >> > > > > >> > >> > > > > >> > >> > > >> > >> > > >> > >> > > >> > > > > >> > > >> > > >> > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: cps-2014-09-11_13-09-49.log.gz Type: application/x-gzip Size: 3906 bytes Desc: not available URL: From hategan at mcs.anl.gov Thu Sep 11 13:44:28 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 11 Sep 2014 11:44:28 -0700 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo> <1409860986.7960.3.camel@echo> <1409876838.3600.8.camel@echo> <1409939838.12288.3.camel@echo> <1410205101.24345.22.camel@echo> <1410457173.25856.5.camel@echo> <1410459811.27191.5.camel@echo> Message-ID: <1410461068.27191.7.camel@echo> Passive workers? On Thu, 2014-09-11 at 13:26 -0500, Tim Armstrong wrote: > Oops, forgot about that > > > > On Thu, Sep 11, 2014 at 1:23 PM, Mihael Hategan wrote: > > > The coaster logging was broken, and that brokenness caused it to print > > everything on stdout. That got fixed, so the actual log is now > > in ./cps*.log. > > > > So I probably need that log. > > > > Mihael > > > > On Thu, 2014-09-11 at 13:10 -0500, Tim Armstrong wrote: > > > I meant the github master., but it turns out that I had had the wrong > > Swift > > > on my path. Apologies for the confusion. > > > > > > I've rerun with the current one. > > > > > > I'm getting a null pointer exception on line 226 of > > > BlockQueueProcessor.java. Adding some printfs revealed that settings was > > > null. > > > > > > Log attached. > > > > > > - Tim > > > > > > Job: Job(id:0 600.000s) > > > Settings: null > > > java.lang.NullPointerException > > > at > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.checkJob(BlockQueueProcessor.java:228) > > > at > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.enqueue1(BlockQueueProcessor.java:210) > > > at > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.enqueue(BlockQueueProcessor.java:204) > > > at > > > > > org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.enqueue(JobQueue.java:103) > > > at > > > > > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:96) > > > at > > > > > org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:112) > > > at > > > > > org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:589) > > > at > > > > > org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:175) > > > at > > org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:90) > > > > > > On Thu, Sep 11, 2014 at 12:41 PM, Tim Armstrong < > > tim.g.armstrong at gmail.com> > > > wrote: > > > > > > > I thought I was running the latest trunk, I'll rebuild and see if I can > > > > reproduce the issue. > > > > > > > > - Tim > > > > > > > > On Thu, Sep 11, 2014 at 12:39 PM, Mihael Hategan > > > > wrote: > > > > > > > >> The method "getMetaChannel()" has been removed. Where did you get the > > > >> code from? > > > >> > > > >> Mihael > > > >> > > > >> On Thu, 2014-09-11 at 12:16 -0500, Tim Armstrong wrote: > > > >> > I'm seeing failures when running Swift/T tests with > > > >> > start-coaster-service.sh. > > > >> > > > > >> > E.g. the turbine test coaster-exec-1. I can provide instructions > > for > > > >> > running the test if needed (roughly, you need to build Swift/T with > > > >> coaster > > > >> > support enabled, then make tests/coaster-exec-1.result in the > > turbine > > > >> > directory). The github swift-t release is up to date if you want > > to use > > > >> > that. > > > >> > > > > >> > Full log is attached, stack trace excerpt is below. > > > >> > > > > >> > - Tim > > > >> > > > > >> > 2014-09-11 12:11:13,708-0500 INFO BlockQueueProcessor Starting... > > > >> > id=0911-1112130 > > > >> > Using threaded sender for TCPChannel [type: server, contact: > > > >> 127.0.0.1:48242 > > > >> > ] > > > >> > 2014-09-11 12:11:13,708-0500 INFO AbstractStreamCoasterChannel > > Using > > > >> > threaded sender for TCPChannel [type: server, contact: > > 127.0.0.1:48242] > > > >> > org.globus.cog.coaster.channels.ChannelException: Invalid channel: > > null > > > >> > @id://null-nullS > > > >> > at > > > >> > > > > >> > > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) > > > >> > at > > > >> > > > > >> > > org.globus.cog.coaster.channels.ChannelManager.reserveChannel(ChannelManager.java:226) > > > >> > at > > > >> > > > > >> > > org.globus.cog.abstraction.coaster.service.job.manager.PassiveQueueProcessor.setClientChannelContext(PassiveQueueProcessor.java:41) > > > >> > at > > > >> > > > > >> > > org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.setClientChannelContext(JobQueue.java:135) > > > >> > at > > > >> > > > > >> > > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:77) > > > >> > at > > > >> > > > > >> > > org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) > > > >> > at > > > >> > > > > >> > > org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) > > > >> > at > > > >> > > > > >> > > org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) > > > >> > at > > > >> org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) > > > >> > provider=local > > > >> > 2014-09-11 12:11:13,930-0500 INFO ExecutionTaskHandler > > provider=local > > > >> > org.globus.cog.coaster.channels.ChannelException: Invalid channel: > > null > > > >> > @id://null-nullS > > > >> > at > > > >> > > > > >> > > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) > > > >> > at > > > >> > > > > >> > > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:432) > > > >> > at > > > >> > > > > >> > > org.globus.cog.coaster.channels.ChannelManager.reserveLongTerm(ChannelManager.java:472) > > > >> > at > > > >> > > > > >> > > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:80) > > > >> > at > > > >> > > > > >> > > org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) > > > >> > at > > > >> > > > > >> > > org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) > > > >> > at > > > >> > > > > >> > > org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) > > > >> > at > > > >> org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) > > > >> > Handler(tag: 38907, SUBMITJOB) sending error: Could not deserialize > > job > > > >> > description > > > >> > org.globus.cog.coaster.ProtocolException: Could not deserialize job > > > >> > description > > > >> > at > > > >> > > > > >> > > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:84) > > > >> > at > > > >> > > > > >> > > org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) > > > >> > at > > > >> > > > > >> > > org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) > > > >> > at > > > >> > > > > >> > > org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) > > > >> > at > > > >> org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) > > > >> > Caused by: org.globus.cog.coaster.channels.ChannelException: Invalid > > > >> > channel: null at id://null-nullS > > > >> > at > > > >> > > > > >> > > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) > > > >> > at > > > >> > > > > >> > > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:432) > > > >> > at > > > >> > > > > >> > > org.globus.cog.coaster.channels.ChannelManager.reserveLongTerm(ChannelManager.java:472) > > > >> > at > > > >> > > > > >> > > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:80) > > > >> > ... 4 more > > > >> > 2014-09-11 12:11:13,937-0500 INFO RequestReply Handler(tag: 38907, > > > >> > SUBMITJOB) sending error: Could not deserialize job description > > > >> > org.globus.cog.coaster.ProtocolException: Could not deserialize job > > > >> > description > > > >> > at > > > >> > > > > >> > > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:84) > > > >> > at > > > >> > > > > >> > > org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) > > > >> > at > > > >> > > > > >> > > org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) > > > >> > at > > > >> > > > > >> > > org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) > > > >> > at > > > >> org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) > > > >> > Caused by: org.globus.cog.coaster.channels.ChannelException: Invalid > > > >> > channel: null at id://null-nullS > > > >> > at > > > >> > > > > >> > > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) > > > >> > at > > > >> > > > > >> > > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:432) > > > >> > at > > > >> > > > > >> > > org.globus.cog.coaster.channels.ChannelManager.reserveLongTerm(ChannelManager.java:472) > > > >> > at > > > >> > > > > >> > > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:80) > > > >> > ... 4 more > > > >> > > > > >> > > > > >> > On Thu, Sep 11, 2014 at 10:30 AM, Tim Armstrong < > > > >> tim.g.armstrong at gmail.com> > > > >> > wrote: > > > >> > > > > >> > > This all sounds great. > > > >> > > > > > >> > > Just to check that I've understood correctly, from the client's > > point > > > >> of > > > >> > > view: > > > >> > > * The per-client settings behave the same if -shared is not > > provided. > > > >> > > * Per-client settings are ignored if -shared is provided > > > >> > > > > > >> > > I had one question: > > > >> > > * Do automatically allocated workers work with per-client > > settings? I > > > >> > > understand there were some issues related to sharing workers > > between > > > >> > > clients. Was the solution to have separate worker pools, or is > > this > > > >> just > > > >> > > not supported? > > > >> > > > > > >> > > - Tim > > > >> > > > > > >> > > On Mon, Sep 8, 2014 at 2:38 PM, Mihael Hategan < > > hategan at mcs.anl.gov> > > > >> > > wrote: > > > >> > > > > > >> > >> So... > > > >> > >> > > > >> > >> There were bugs. Lots of bugs. > > > >> > >> I did some work over the weekend to fix some of these and clean > > up > > > >> the > > > >> > >> coaster code. Here's a summary: > > > >> > >> > > > >> > >> - there was some stuff in the low level coaster code to deal with > > > >> > >> persisting coaster channels over multiple connections with > > various > > > >> > >> options, like periodic connections, client or server initiated > > > >> > >> connections, buffering of commands, etc. None of this was used by > > > >> Swift, > > > >> > >> and the code was pretty messy. I removed that. > > > >> > >> - there were some issues with multiple clients: > > > >> > >> * improper shutdown of relevant workers when a client > > disconnected > > > >> > >> * the worker task dispatcher was a singleton and had a > > reference to > > > >> > >> one block allocator, whereas multiple clients involved multiple > > > >> > >> allocators. > > > >> > >> - there were a bunch of locking issues in the C client that > > valgrind > > > >> > >> caught > > > >> > >> - the idea of remote job ids was a bit hard to work with. This > > > >> remote id > > > >> > >> was the job id that the service assigned to a job. This is > > necessary > > > >> > >> because two different clients can submit jobs with the same id. > > The > > > >> > >> remote id would be communicated to the client as the reply to the > > > >> submit > > > >> > >> request. However, it was entirely possible for a notification > > about > > > >> job > > > >> > >> status to be sent to the client before the submit reply was. > > Since > > > >> > >> notifications were sent using the remote-id, the client would > > have no > > > >> > >> idea what job the notifications belonged to. Now, the server > > might > > > >> need > > > >> > >> a unique job id, but there is no reason why it cannot use the > > client > > > >> id > > > >> > >> when communicating the status to a client. So that's there now. > > > >> > >> - the way the C client was working, its jobs ended up not going > > to > > > >> the > > > >> > >> workers, but the local queue. The service settings now allow > > > >> specifying > > > >> > >> the provider/jobManager/url to be used to start blocks, and jobs > > are > > > >> > >> routed appropriately if they do not have the batch job flag set. > > > >> > >> > > > >> > >> I also added a shared service mode. We discussed this before. > > > >> Basically > > > >> > >> you start the coaster service with "-shared > > " and > > > >> > >> all the settings are read from that file. In this case, all > > clients > > > >> > >> share the same worker pool, and client settings are ignored. > > > >> > >> > > > >> > >> The C client now has a multi-job testing tool which can submit > > many > > > >> jobs > > > >> > >> with the desired level of concurrency. > > > >> > >> > > > >> > >> I have tested the C client with both shared and non-shared mode, > > with > > > >> > >> various levels of jobs being sent, with either one or two > > concurrent > > > >> > >> clients. > > > >> > >> > > > >> > >> I haven't tested manual workers. > > > >> > >> > > > >> > >> I've also decided that during normal operation (i.e. client > > connects, > > > >> > >> submits jobs, shuts down gracefully), there should be no > > exceptions > > > >> in > > > >> > >> the coaster log. I think we should stick to that principle. This > > was > > > >> the > > > >> > >> case last I tested, and we should consider any deviation from > > that > > > >> to be > > > >> > >> a problem. Of course, there are some things for which there is no > > > >> > >> graceful shut down, such as ctrl+C-ing a manual worker. > > Exceptions > > > >> are > > > >> > >> fine in that case. > > > >> > >> > > > >> > >> So anyway, let's start from here. > > > >> > >> > > > >> > >> Mihael > > > >> > >> > > > >> > >> On Fri, 2014-09-05 at 13:09 -0500, Tim Armstrong wrote: > > > >> > >> > Thanks, let me know if there's anything I can help do. > > > >> > >> > > > > >> > >> > - Tim > > > >> > >> > > > > >> > >> > > > > >> > >> > On Fri, Sep 5, 2014 at 12:57 PM, Mihael Hategan < > > > >> hategan at mcs.anl.gov> > > > >> > >> wrote: > > > >> > >> > > > > >> > >> > > Thanks. It also seems that there is an older bug in there in > > > >> which the > > > >> > >> > > client connection is not properly accounted for and things > > start > > > >> > >> failing > > > >> > >> > > two minutes after the client connects (which is also probably > > > >> why you > > > >> > >> > > didn't see this in runs with many short client connections). > > I'm > > > >> not > > > >> > >> > > sure why the fix for that bug isn't in the trunk code. > > > >> > >> > > > > > >> > >> > > In any event, I'll set up a client submission loop and fix > > all > > > >> these > > > >> > >> > > things. > > > >> > >> > > > > > >> > >> > > Mihael > > > >> > >> > > > > > >> > >> > > On Fri, 2014-09-05 at 12:13 -0500, Tim Armstrong wrote: > > > >> > >> > > > Ok, here it is with the additional debug messages. Source > > code > > > >> > >> change is > > > >> > >> > > > in commit 890c41f2ba701b10264553471590096d6f94c278. > > > >> > >> > > > > > > >> > >> > > > Warning: the tarball will expand to several gigabytes of > > logs. > > > >> > >> > > > > > > >> > >> > > > I had to do multiple client runs to trigger it. It seems > > like > > > >> the > > > >> > >> > > problem > > > >> > >> > > > might be triggered by abnormal termination of the client. > > > >> First 18 > > > >> > >> runs > > > >> > >> > > > went fine, problem only started when I ctrl-c-ed the > > swift/t > > > >> run #19 > > > >> > >> > > before > > > >> > >> > > > the run #20 that exhibited delays. > > > >> > >> > > > > > > >> > >> > > > > > http://people.cs.uchicago.edu/~tga/files/worker-logs3.tar.gz > > > >> > >> > > > > > > >> > >> > > > - Tim > > > >> > >> > > > > > > >> > >> > > > > > > >> > >> > > > On Fri, Sep 5, 2014 at 8:55 AM, Tim Armstrong < > > > >> > >> tim.g.armstrong at gmail.com > > > >> > >> > > > > > > >> > >> > > > wrote: > > > >> > >> > > > > > > >> > >> > > > > It's here: > > > >> > >> > > > > > > > >> > >> > > http://people.cs.uchicago.edu/~tga/files/coaster-service.out.full.gz > > > >> . > > > >> > >> > > > > > > > >> > >> > > > > I'll add some extra debug messages in the coaster C++ > > client > > > >> and > > > >> > >> see > > > >> > >> > > if I > > > >> > >> > > > > can recreate the scenario. > > > >> > >> > > > > > > > >> > >> > > > > - Tim > > > >> > >> > > > > > > > >> > >> > > > > > > > >> > >> > > > > On Thu, Sep 4, 2014 at 7:27 PM, Mihael Hategan < > > > >> > >> hategan at mcs.anl.gov> > > > >> > >> > > > > wrote: > > > >> > >> > > > > > > > >> > >> > > > >> Ok, so that's legit. > > > >> > >> > > > >> > > > >> > >> > > > >> It does look like shut down workers are not properly > > > >> accounted > > > >> > >> for in > > > >> > >> > > > >> some places (and I believe Yadu submitted a bug for > > this). > > > >> > >> However, I > > > >> > >> > > do > > > >> > >> > > > >> not see the dead time you mention in either of the last > > two > > > >> sets > > > >> > >> of > > > >> > >> > > > >> logs. It looks like each client instance submits a > > continous > > > >> > >> stream of > > > >> > >> > > > >> jobs. > > > >> > >> > > > >> > > > >> > >> > > > >> So let's get back to the initial log. Can I have the > > full > > > >> > >> service log? > > > >> > >> > > > >> I'm trying to track what happened with the jobs > > submitted > > > >> before > > > >> > >> the > > > >> > >> > > > >> first big pause. > > > >> > >> > > > >> > > > >> > >> > > > >> Also, a log message in CoasterClient::updateJobStatus() > > (or > > > >> > >> friends) > > > >> > >> > > > >> would probably help a lot here. > > > >> > >> > > > >> > > > >> > >> > > > >> Mihael > > > >> > >> > > > >> > > > >> > >> > > > >> On Thu, 2014-09-04 at 15:34 -0500, Tim Armstrong wrote: > > > >> > >> > > > >> > Should be here: > > > >> > >> > > > >> > > > > >> > >> > > > >> > > > http://people.cs.uchicago.edu/~tga/worker-logs2.tar.gz > > > >> > >> > > > >> > > > > >> > >> > > > >> > > > > >> > >> > > > >> > > > > >> > >> > > > >> > > > > >> > >> > > > >> > On Thu, Sep 4, 2014 at 3:03 PM, Mihael Hategan < > > > >> > >> hategan at mcs.anl.gov > > > >> > >> > > > > > > >> > >> > > > >> wrote: > > > >> > >> > > > >> > > > > >> > >> > > > >> > > The first worker "failing" is 0904-20022331. The log > > > >> looks > > > >> > >> funny > > > >> > >> > > at > > > >> > >> > > > >> the > > > >> > >> > > > >> > > end. > > > >> > >> > > > >> > > > > > >> > >> > > > >> > > Can you git pull and re-run? The worker is getting > > some > > > >> > >> command > > > >> > >> > > at the > > > >> > >> > > > >> > > end there and doing nothing about it and I wonder > > why. > > > >> > >> > > > >> > > > > > >> > >> > > > >> > > Mihael > > > >> > >> > > > >> > > > > > >> > >> > > > >> > > On Thu, 2014-09-04 at 14:35 -0500, Tim Armstrong > > wrote: > > > >> > >> > > > >> > > > Ok, now I have some worker logs: > > > >> > >> > > > >> > > > > > > >> > >> > > > >> > > > > > > >> > >> http://people.cs.uchicago.edu/~tga/2014-9-4-worker-logs.tar.gz > > > >> > >> > > > >> > > > > > > >> > >> > > > >> > > > There's nothing obvious I see in the worker logs > > that > > > >> would > > > >> > >> > > > >> indicate why > > > >> > >> > > > >> > > > the connection was broken. > > > >> > >> > > > >> > > > > > > >> > >> > > > >> > > > - Tim > > > >> > >> > > > >> > > > > > > >> > >> > > > >> > > > > > > >> > >> > > > >> > > > On Thu, Sep 4, 2014 at 1:11 PM, Tim Armstrong < > > > >> > >> > > > >> tim.g.armstrong at gmail.com > > > >> > >> > > > >> > > > > > > >> > >> > > > >> > > > wrote: > > > >> > >> > > > >> > > > > > > >> > >> > > > >> > > > > This is all running locally on my laptop, so I > > > >> think we > > > >> > >> can > > > >> > >> > > rule > > > >> > >> > > > >> out > > > >> > >> > > > >> > > 1). > > > >> > >> > > > >> > > > > > > > >> > >> > > > >> > > > > It also seems like it's a state the coaster > > service > > > >> gets > > > >> > >> into > > > >> > >> > > > >> after a > > > >> > >> > > > >> > > few > > > >> > >> > > > >> > > > > client sessions: generally the first coaster run > > > >> works > > > >> > >> fine, > > > >> > >> > > then > > > >> > >> > > > >> > > after a > > > >> > >> > > > >> > > > > few runs the problem occurs more frequently. > > > >> > >> > > > >> > > > > > > > >> > >> > > > >> > > > > I'm going to try and get worker logs, in the > > > >> meantime > > > >> > >> i've got > > > >> > >> > > > >> some > > > >> > >> > > > >> > > > > jstacks (attached). > > > >> > >> > > > >> > > > > > > > >> > >> > > > >> > > > > Matching service logs (largish) are here if > > needed: > > > >> > >> > > > >> > > > > > > http://people.cs.uchicago.edu/~tga/service.out.gz > > > >> > >> > > > >> > > > > > > > >> > >> > > > >> > > > > > > > >> > >> > > > >> > > > > On Wed, Sep 3, 2014 at 10:35 PM, Mihael Hategan > > < > > > >> > >> > > > >> hategan at mcs.anl.gov> > > > >> > >> > > > >> > > > > wrote: > > > >> > >> > > > >> > > > > > > > >> > >> > > > >> > > > >> Ah, makes sense. > > > >> > >> > > > >> > > > >> > > > >> > >> > > > >> > > > >> 2 minutes is the channel timeout. Each live > > > >> connection > > > >> > >> is > > > >> > >> > > > >> guaranteed > > > >> > >> > > > >> > > to > > > >> > >> > > > >> > > > >> have some communication for any 2 minute time > > > >> window, > > > >> > >> > > partially > > > >> > >> > > > >> due to > > > >> > >> > > > >> > > > >> periodic heartbeats (sent every 1 minute). If > > no > > > >> > >> packets flow > > > >> > >> > > > >> for the > > > >> > >> > > > >> > > > >> duration of 2 minutes, the connection is > > assumed > > > >> broken > > > >> > >> and > > > >> > >> > > all > > > >> > >> > > > >> jobs > > > >> > >> > > > >> > > > >> that were submitted to the respective workers > > are > > > >> > >> considered > > > >> > >> > > > >> failed. > > > >> > >> > > > >> > > So > > > >> > >> > > > >> > > > >> there seems to be an issue with the > > connections to > > > >> some > > > >> > >> of > > > >> > >> > > the > > > >> > >> > > > >> > > workers, > > > >> > >> > > > >> > > > >> and it takes 2 minutes to detect them. > > > >> > >> > > > >> > > > >> > > > >> > >> > > > >> > > > >> Since the service seems to be alive (although a > > > >> jstack > > > >> > >> on the > > > >> > >> > > > >> service > > > >> > >> > > > >> > > > >> when thing seem to hang might help), this > > leaves > > > >> two > > > >> > >> > > > >> possibilities: > > > >> > >> > > > >> > > > >> 1 - some genuine network problem > > > >> > >> > > > >> > > > >> 2 - the worker died without properly closing > > TCP > > > >> > >> connections > > > >> > >> > > > >> > > > >> > > > >> > >> > > > >> > > > >> If (2), you could enable worker logging > > > >> > >> > > > >> > > > >> (Settings::Key::WORKER_LOGGING_LEVEL = > > "DEBUG") to > > > >> see > > > >> > >> if > > > >> > >> > > > >> anything > > > >> > >> > > > >> > > shows > > > >> > >> > > > >> > > > >> up. > > > >> > >> > > > >> > > > >> > > > >> > >> > > > >> > > > >> Mihael > > > >> > >> > > > >> > > > >> > > > >> > >> > > > >> > > > >> On Wed, 2014-09-03 at 20:26 -0500, Tim > > Armstrong > > > >> wrote: > > > >> > >> > > > >> > > > >> > Here are client and service logs, with part > > of > > > >> > >> service log > > > >> > >> > > > >> edited > > > >> > >> > > > >> > > down > > > >> > >> > > > >> > > > >> to > > > >> > >> > > > >> > > > >> > be a reasonable size (I have the full thing > > if > > > >> > >> needed, but > > > >> > >> > > it > > > >> > >> > > > >> was > > > >> > >> > > > >> > > over a > > > >> > >> > > > >> > > > >> > gigabyte). > > > >> > >> > > > >> > > > >> > > > > >> > >> > > > >> > > > >> > One relevant section is from 19:49:35 > > onwards. > > > >> The > > > >> > >> client > > > >> > >> > > > >> submits 4 > > > >> > >> > > > >> > > > >> jobs > > > >> > >> > > > >> > > > >> > (its limit), but they don't complete until > > > >> 19:51:32 > > > >> > >> or so > > > >> > >> > > (I > > > >> > >> > > > >> can see > > > >> > >> > > > >> > > > >> that > > > >> > >> > > > >> > > > >> > one task completed based on ncompleted=1 in > > the > > > >> > >> > > check_tasks log > > > >> > >> > > > >> > > > >> message). > > > >> > >> > > > >> > > > >> > It looks like something has happened with > > broken > > > >> > >> pipes and > > > >> > >> > > > >> workers > > > >> > >> > > > >> > > being > > > >> > >> > > > >> > > > >> > lost, but I'm not sure what the ultimate > > cause of > > > >> > >> that is > > > >> > >> > > > >> likely to > > > >> > >> > > > >> > > be. > > > >> > >> > > > >> > > > >> > > > > >> > >> > > > >> > > > >> > - Tim > > > >> > >> > > > >> > > > >> > > > > >> > >> > > > >> > > > >> > > > > >> > >> > > > >> > > > >> > > > > >> > >> > > > >> > > > >> > On Wed, Sep 3, 2014 at 6:20 PM, Mihael > > Hategan < > > > >> > >> > > > >> hategan at mcs.anl.gov > > > >> > >> > > > >> > > > > > > >> > >> > > > >> > > > >> wrote: > > > >> > >> > > > >> > > > >> > > > > >> > >> > > > >> > > > >> > > Hi Tim, > > > >> > >> > > > >> > > > >> > > > > > >> > >> > > > >> > > > >> > > I've never seen this before with pure Java. > > > >> > >> > > > >> > > > >> > > > > > >> > >> > > > >> > > > >> > > Do you have logs from these runs? > > > >> > >> > > > >> > > > >> > > > > > >> > >> > > > >> > > > >> > > Mihael > > > >> > >> > > > >> > > > >> > > > > > >> > >> > > > >> > > > >> > > On Wed, 2014-09-03 at 16:49 -0500, Tim > > > >> Armstrong > > > >> > >> wrote: > > > >> > >> > > > >> > > > >> > > > I'm running a test Swift/T script that > > submit > > > >> > >> tasks to > > > >> > >> > > > >> Coasters > > > >> > >> > > > >> > > > >> through > > > >> > >> > > > >> > > > >> > > the > > > >> > >> > > > >> > > > >> > > > C++ client and I'm seeing some odd > > behaviour > > > >> > >> where task > > > >> > >> > > > >> > > > >> > > > submission/execution is stalling for ~2 > > > >> minute > > > >> > >> periods. > > > >> > >> > > > >> For > > > >> > >> > > > >> > > > >> example, I'm > > > >> > >> > > > >> > > > >> > > > seeing submit log messages like > > "submitting > > > >> > >> > > > >> > > > >> > > > urn:133-1409778135377-1409778135378: > > > >> > >> /bin/hostname" in > > > >> > >> > > > >> bursts of > > > >> > >> > > > >> > > > >> several > > > >> > >> > > > >> > > > >> > > > seconds with a gap of roughly 2 minutes > > in > > > >> > >> between, > > > >> > >> > > e.g. > > > >> > >> > > > >> I'm > > > >> > >> > > > >> > > seeing > > > >> > >> > > > >> > > > >> > > bursts > > > >> > >> > > > >> > > > >> > > > with the following intervals in my logs. > > > >> > >> > > > >> > > > >> > > > > > > >> > >> > > > >> > > > >> > > > 16:07:04,603 to 16:07:10,391 > > > >> > >> > > > >> > > > >> > > > 16:09:07,377 to 16:09:13,076 > > > >> > >> > > > >> > > > >> > > > 16:11:10,005 to 16:11:16,770 > > > >> > >> > > > >> > > > >> > > > 16:13:13,291 to 16:13:19,296 > > > >> > >> > > > >> > > > >> > > > 16:15:16,000 to 16:15:21,602 > > > >> > >> > > > >> > > > >> > > > > > > >> > >> > > > >> > > > >> > > > From what I can tell, the delay is on the > > > >> coaster > > > >> > >> > > service > > > >> > >> > > > >> side: > > > >> > >> > > > >> > > the > > > >> > >> > > > >> > > > >> C > > > >> > >> > > > >> > > > >> > > > client is just waiting for a response. > > > >> > >> > > > >> > > > >> > > > > > > >> > >> > > > >> > > > >> > > > The jobs are just being submitted > > through the > > > >> > >> local job > > > >> > >> > > > >> > > manager, so > > > >> > >> > > > >> > > > >> I > > > >> > >> > > > >> > > > >> > > > wouldn't expect any delays there. The > > tasks > > > >> are > > > >> > >> also > > > >> > >> > > just > > > >> > >> > > > >> > > > >> > > "/bin/hostname", > > > >> > >> > > > >> > > > >> > > > so should return immediately. > > > >> > >> > > > >> > > > >> > > > > > > >> > >> > > > >> > > > >> > > > I'm going to continue digging into this > > on my > > > >> > >> own, but > > > >> > >> > > the > > > >> > >> > > > >> 2 > > > >> > >> > > > >> > > minute > > > >> > >> > > > >> > > > >> delay > > > >> > >> > > > >> > > > >> > > > seems like a big clue: does anyone have > > an > > > >> idea > > > >> > >> what > > > >> > >> > > could > > > >> > >> > > > >> cause > > > >> > >> > > > >> > > > >> stalls > > > >> > >> > > > >> > > > >> > > in > > > >> > >> > > > >> > > > >> > > > task submission of 2 minute duration? > > > >> > >> > > > >> > > > >> > > > > > > >> > >> > > > >> > > > >> > > > Cheers, > > > >> > >> > > > >> > > > >> > > > Tim > > > >> > >> > > > >> > > > >> > > > > > >> > >> > > > >> > > > >> > > > > > >> > >> > > > >> > > > >> > > > > > >> > >> > > > >> > > > >> > > > >> > >> > > > >> > > > >> > > > >> > >> > > > >> > > > >> > > > >> > >> > > > >> > > > > > > > >> > >> > > > >> > > > > > >> > >> > > > >> > > > > > >> > >> > > > >> > > > > > >> > >> > > > >> > > > >> > >> > > > >> > > > >> > >> > > > >> > > > >> > >> > > > > > > > >> > >> > > > > > >> > >> > > > > > >> > >> > > > > > >> > >> > > > >> > >> > > > >> > >> > > > >> > > > > > >> > > > >> > > > >> > > > > > > > > > > From tim.g.armstrong at gmail.com Thu Sep 11 13:54:21 2014 From: tim.g.armstrong at gmail.com (Tim Armstrong) Date: Thu, 11 Sep 2014 13:54:21 -0500 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: <1410461068.27191.7.camel@echo> References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo> <1409860986.7960.3.camel@echo> <1409876838.3600.8.camel@echo> <1409939838.12288.3.camel@echo> <1410205101.24345.22.camel@echo> <1410457173.25856.5.camel@echo> <1410459811.27191.5.camel@echo> <1410461068.27191.7.camel@echo> Message-ID: Yeah, local workers, this was started with start-coaster-service with conf file: export WORKER_MODE=local export IPADDR=127.0.0.1 export SERVICE_PORT=53363 export JOBSPERNODE=4 export LOGDIR=$(pwd) - Tim On Thu, Sep 11, 2014 at 1:44 PM, Mihael Hategan wrote: > Passive workers? > > On Thu, 2014-09-11 at 13:26 -0500, Tim Armstrong wrote: > > Oops, forgot about that > > > > > > > > On Thu, Sep 11, 2014 at 1:23 PM, Mihael Hategan > wrote: > > > > > The coaster logging was broken, and that brokenness caused it to print > > > everything on stdout. That got fixed, so the actual log is now > > > in ./cps*.log. > > > > > > So I probably need that log. > > > > > > Mihael > > > > > > On Thu, 2014-09-11 at 13:10 -0500, Tim Armstrong wrote: > > > > I meant the github master., but it turns out that I had had the wrong > > > Swift > > > > on my path. Apologies for the confusion. > > > > > > > > I've rerun with the current one. > > > > > > > > I'm getting a null pointer exception on line 226 of > > > > BlockQueueProcessor.java. Adding some printfs revealed that > settings was > > > > null. > > > > > > > > Log attached. > > > > > > > > - Tim > > > > > > > > Job: Job(id:0 600.000s) > > > > Settings: null > > > > java.lang.NullPointerException > > > > at > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.checkJob(BlockQueueProcessor.java:228) > > > > at > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.enqueue1(BlockQueueProcessor.java:210) > > > > at > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.enqueue(BlockQueueProcessor.java:204) > > > > at > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.enqueue(JobQueue.java:103) > > > > at > > > > > > > > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:96) > > > > at > > > > > > > > org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:112) > > > > at > > > > > > > > org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:589) > > > > at > > > > > > > > org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:175) > > > > at > > > org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:90) > > > > > > > > On Thu, Sep 11, 2014 at 12:41 PM, Tim Armstrong < > > > tim.g.armstrong at gmail.com> > > > > wrote: > > > > > > > > > I thought I was running the latest trunk, I'll rebuild and see if > I can > > > > > reproduce the issue. > > > > > > > > > > - Tim > > > > > > > > > > On Thu, Sep 11, 2014 at 12:39 PM, Mihael Hategan < > hategan at mcs.anl.gov> > > > > > wrote: > > > > > > > > > >> The method "getMetaChannel()" has been removed. Where did you get > the > > > > >> code from? > > > > >> > > > > >> Mihael > > > > >> > > > > >> On Thu, 2014-09-11 at 12:16 -0500, Tim Armstrong wrote: > > > > >> > I'm seeing failures when running Swift/T tests with > > > > >> > start-coaster-service.sh. > > > > >> > > > > > >> > E.g. the turbine test coaster-exec-1. I can provide > instructions > > > for > > > > >> > running the test if needed (roughly, you need to build Swift/T > with > > > > >> coaster > > > > >> > support enabled, then make tests/coaster-exec-1.result in the > > > turbine > > > > >> > directory). The github swift-t release is up to date if you > want > > > to use > > > > >> > that. > > > > >> > > > > > >> > Full log is attached, stack trace excerpt is below. > > > > >> > > > > > >> > - Tim > > > > >> > > > > > >> > 2014-09-11 12:11:13,708-0500 INFO BlockQueueProcessor > Starting... > > > > >> > id=0911-1112130 > > > > >> > Using threaded sender for TCPChannel [type: server, contact: > > > > >> 127.0.0.1:48242 > > > > >> > ] > > > > >> > 2014-09-11 12:11:13,708-0500 INFO AbstractStreamCoasterChannel > > > Using > > > > >> > threaded sender for TCPChannel [type: server, contact: > > > 127.0.0.1:48242] > > > > >> > org.globus.cog.coaster.channels.ChannelException: Invalid > channel: > > > null > > > > >> > @id://null-nullS > > > > >> > at > > > > >> > > > > > >> > > > > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) > > > > >> > at > > > > >> > > > > > >> > > > > org.globus.cog.coaster.channels.ChannelManager.reserveChannel(ChannelManager.java:226) > > > > >> > at > > > > >> > > > > > >> > > > > org.globus.cog.abstraction.coaster.service.job.manager.PassiveQueueProcessor.setClientChannelContext(PassiveQueueProcessor.java:41) > > > > >> > at > > > > >> > > > > > >> > > > > org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.setClientChannelContext(JobQueue.java:135) > > > > >> > at > > > > >> > > > > > >> > > > > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:77) > > > > >> > at > > > > >> > > > > > >> > > > > org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) > > > > >> > at > > > > >> > > > > > >> > > > > org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) > > > > >> > at > > > > >> > > > > > >> > > > > org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) > > > > >> > at > > > > >> > org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) > > > > >> > provider=local > > > > >> > 2014-09-11 12:11:13,930-0500 INFO ExecutionTaskHandler > > > provider=local > > > > >> > org.globus.cog.coaster.channels.ChannelException: Invalid > channel: > > > null > > > > >> > @id://null-nullS > > > > >> > at > > > > >> > > > > > >> > > > > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) > > > > >> > at > > > > >> > > > > > >> > > > > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:432) > > > > >> > at > > > > >> > > > > > >> > > > > org.globus.cog.coaster.channels.ChannelManager.reserveLongTerm(ChannelManager.java:472) > > > > >> > at > > > > >> > > > > > >> > > > > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:80) > > > > >> > at > > > > >> > > > > > >> > > > > org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) > > > > >> > at > > > > >> > > > > > >> > > > > org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) > > > > >> > at > > > > >> > > > > > >> > > > > org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) > > > > >> > at > > > > >> > org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) > > > > >> > Handler(tag: 38907, SUBMITJOB) sending error: Could not > deserialize > > > job > > > > >> > description > > > > >> > org.globus.cog.coaster.ProtocolException: Could not deserialize > job > > > > >> > description > > > > >> > at > > > > >> > > > > > >> > > > > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:84) > > > > >> > at > > > > >> > > > > > >> > > > > org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) > > > > >> > at > > > > >> > > > > > >> > > > > org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) > > > > >> > at > > > > >> > > > > > >> > > > > org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) > > > > >> > at > > > > >> > org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) > > > > >> > Caused by: org.globus.cog.coaster.channels.ChannelException: > Invalid > > > > >> > channel: null at id://null-nullS > > > > >> > at > > > > >> > > > > > >> > > > > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) > > > > >> > at > > > > >> > > > > > >> > > > > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:432) > > > > >> > at > > > > >> > > > > > >> > > > > org.globus.cog.coaster.channels.ChannelManager.reserveLongTerm(ChannelManager.java:472) > > > > >> > at > > > > >> > > > > > >> > > > > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:80) > > > > >> > ... 4 more > > > > >> > 2014-09-11 12:11:13,937-0500 INFO RequestReply Handler(tag: > 38907, > > > > >> > SUBMITJOB) sending error: Could not deserialize job description > > > > >> > org.globus.cog.coaster.ProtocolException: Could not deserialize > job > > > > >> > description > > > > >> > at > > > > >> > > > > > >> > > > > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:84) > > > > >> > at > > > > >> > > > > > >> > > > > org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) > > > > >> > at > > > > >> > > > > > >> > > > > org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) > > > > >> > at > > > > >> > > > > > >> > > > > org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) > > > > >> > at > > > > >> > org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) > > > > >> > Caused by: org.globus.cog.coaster.channels.ChannelException: > Invalid > > > > >> > channel: null at id://null-nullS > > > > >> > at > > > > >> > > > > > >> > > > > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) > > > > >> > at > > > > >> > > > > > >> > > > > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:432) > > > > >> > at > > > > >> > > > > > >> > > > > org.globus.cog.coaster.channels.ChannelManager.reserveLongTerm(ChannelManager.java:472) > > > > >> > at > > > > >> > > > > > >> > > > > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:80) > > > > >> > ... 4 more > > > > >> > > > > > >> > > > > > >> > On Thu, Sep 11, 2014 at 10:30 AM, Tim Armstrong < > > > > >> tim.g.armstrong at gmail.com> > > > > >> > wrote: > > > > >> > > > > > >> > > This all sounds great. > > > > >> > > > > > > >> > > Just to check that I've understood correctly, from the > client's > > > point > > > > >> of > > > > >> > > view: > > > > >> > > * The per-client settings behave the same if -shared is not > > > provided. > > > > >> > > * Per-client settings are ignored if -shared is provided > > > > >> > > > > > > >> > > I had one question: > > > > >> > > * Do automatically allocated workers work with per-client > > > settings? I > > > > >> > > understand there were some issues related to sharing workers > > > between > > > > >> > > clients. Was the solution to have separate worker pools, or > is > > > this > > > > >> just > > > > >> > > not supported? > > > > >> > > > > > > >> > > - Tim > > > > >> > > > > > > >> > > On Mon, Sep 8, 2014 at 2:38 PM, Mihael Hategan < > > > hategan at mcs.anl.gov> > > > > >> > > wrote: > > > > >> > > > > > > >> > >> So... > > > > >> > >> > > > > >> > >> There were bugs. Lots of bugs. > > > > >> > >> I did some work over the weekend to fix some of these and > clean > > > up > > > > >> the > > > > >> > >> coaster code. Here's a summary: > > > > >> > >> > > > > >> > >> - there was some stuff in the low level coaster code to deal > with > > > > >> > >> persisting coaster channels over multiple connections with > > > various > > > > >> > >> options, like periodic connections, client or server > initiated > > > > >> > >> connections, buffering of commands, etc. None of this was > used by > > > > >> Swift, > > > > >> > >> and the code was pretty messy. I removed that. > > > > >> > >> - there were some issues with multiple clients: > > > > >> > >> * improper shutdown of relevant workers when a client > > > disconnected > > > > >> > >> * the worker task dispatcher was a singleton and had a > > > reference to > > > > >> > >> one block allocator, whereas multiple clients involved > multiple > > > > >> > >> allocators. > > > > >> > >> - there were a bunch of locking issues in the C client that > > > valgrind > > > > >> > >> caught > > > > >> > >> - the idea of remote job ids was a bit hard to work with. > This > > > > >> remote id > > > > >> > >> was the job id that the service assigned to a job. This is > > > necessary > > > > >> > >> because two different clients can submit jobs with the same > id. > > > The > > > > >> > >> remote id would be communicated to the client as the reply > to the > > > > >> submit > > > > >> > >> request. However, it was entirely possible for a notification > > > about > > > > >> job > > > > >> > >> status to be sent to the client before the submit reply was. > > > Since > > > > >> > >> notifications were sent using the remote-id, the client would > > > have no > > > > >> > >> idea what job the notifications belonged to. Now, the server > > > might > > > > >> need > > > > >> > >> a unique job id, but there is no reason why it cannot use the > > > client > > > > >> id > > > > >> > >> when communicating the status to a client. So that's there > now. > > > > >> > >> - the way the C client was working, its jobs ended up not > going > > > to > > > > >> the > > > > >> > >> workers, but the local queue. The service settings now allow > > > > >> specifying > > > > >> > >> the provider/jobManager/url to be used to start blocks, and > jobs > > > are > > > > >> > >> routed appropriately if they do not have the batch job flag > set. > > > > >> > >> > > > > >> > >> I also added a shared service mode. We discussed this before. > > > > >> Basically > > > > >> > >> you start the coaster service with "-shared > > > " and > > > > >> > >> all the settings are read from that file. In this case, all > > > clients > > > > >> > >> share the same worker pool, and client settings are ignored. > > > > >> > >> > > > > >> > >> The C client now has a multi-job testing tool which can > submit > > > many > > > > >> jobs > > > > >> > >> with the desired level of concurrency. > > > > >> > >> > > > > >> > >> I have tested the C client with both shared and non-shared > mode, > > > with > > > > >> > >> various levels of jobs being sent, with either one or two > > > concurrent > > > > >> > >> clients. > > > > >> > >> > > > > >> > >> I haven't tested manual workers. > > > > >> > >> > > > > >> > >> I've also decided that during normal operation (i.e. client > > > connects, > > > > >> > >> submits jobs, shuts down gracefully), there should be no > > > exceptions > > > > >> in > > > > >> > >> the coaster log. I think we should stick to that principle. > This > > > was > > > > >> the > > > > >> > >> case last I tested, and we should consider any deviation from > > > that > > > > >> to be > > > > >> > >> a problem. Of course, there are some things for which there > is no > > > > >> > >> graceful shut down, such as ctrl+C-ing a manual worker. > > > Exceptions > > > > >> are > > > > >> > >> fine in that case. > > > > >> > >> > > > > >> > >> So anyway, let's start from here. > > > > >> > >> > > > > >> > >> Mihael > > > > >> > >> > > > > >> > >> On Fri, 2014-09-05 at 13:09 -0500, Tim Armstrong wrote: > > > > >> > >> > Thanks, let me know if there's anything I can help do. > > > > >> > >> > > > > > >> > >> > - Tim > > > > >> > >> > > > > > >> > >> > > > > > >> > >> > On Fri, Sep 5, 2014 at 12:57 PM, Mihael Hategan < > > > > >> hategan at mcs.anl.gov> > > > > >> > >> wrote: > > > > >> > >> > > > > > >> > >> > > Thanks. It also seems that there is an older bug in > there in > > > > >> which the > > > > >> > >> > > client connection is not properly accounted for and > things > > > start > > > > >> > >> failing > > > > >> > >> > > two minutes after the client connects (which is also > probably > > > > >> why you > > > > >> > >> > > didn't see this in runs with many short client > connections). > > > I'm > > > > >> not > > > > >> > >> > > sure why the fix for that bug isn't in the trunk code. > > > > >> > >> > > > > > > >> > >> > > In any event, I'll set up a client submission loop and > fix > > > all > > > > >> these > > > > >> > >> > > things. > > > > >> > >> > > > > > > >> > >> > > Mihael > > > > >> > >> > > > > > > >> > >> > > On Fri, 2014-09-05 at 12:13 -0500, Tim Armstrong wrote: > > > > >> > >> > > > Ok, here it is with the additional debug messages. > Source > > > code > > > > >> > >> change is > > > > >> > >> > > > in commit 890c41f2ba701b10264553471590096d6f94c278. > > > > >> > >> > > > > > > > >> > >> > > > Warning: the tarball will expand to several gigabytes > of > > > logs. > > > > >> > >> > > > > > > > >> > >> > > > I had to do multiple client runs to trigger it. It > seems > > > like > > > > >> the > > > > >> > >> > > problem > > > > >> > >> > > > might be triggered by abnormal termination of the > client. > > > > >> First 18 > > > > >> > >> runs > > > > >> > >> > > > went fine, problem only started when I ctrl-c-ed the > > > swift/t > > > > >> run #19 > > > > >> > >> > > before > > > > >> > >> > > > the run #20 that exhibited delays. > > > > >> > >> > > > > > > > >> > >> > > > > > > http://people.cs.uchicago.edu/~tga/files/worker-logs3.tar.gz > > > > >> > >> > > > > > > > >> > >> > > > - Tim > > > > >> > >> > > > > > > > >> > >> > > > > > > > >> > >> > > > On Fri, Sep 5, 2014 at 8:55 AM, Tim Armstrong < > > > > >> > >> tim.g.armstrong at gmail.com > > > > >> > >> > > > > > > > >> > >> > > > wrote: > > > > >> > >> > > > > > > > >> > >> > > > > It's here: > > > > >> > >> > > > > > > > > >> > >> > > > http://people.cs.uchicago.edu/~tga/files/coaster-service.out.full.gz > > > > >> . > > > > >> > >> > > > > > > > > >> > >> > > > > I'll add some extra debug messages in the coaster C++ > > > client > > > > >> and > > > > >> > >> see > > > > >> > >> > > if I > > > > >> > >> > > > > can recreate the scenario. > > > > >> > >> > > > > > > > > >> > >> > > > > - Tim > > > > >> > >> > > > > > > > > >> > >> > > > > > > > > >> > >> > > > > On Thu, Sep 4, 2014 at 7:27 PM, Mihael Hategan < > > > > >> > >> hategan at mcs.anl.gov> > > > > >> > >> > > > > wrote: > > > > >> > >> > > > > > > > > >> > >> > > > >> Ok, so that's legit. > > > > >> > >> > > > >> > > > > >> > >> > > > >> It does look like shut down workers are not properly > > > > >> accounted > > > > >> > >> for in > > > > >> > >> > > > >> some places (and I believe Yadu submitted a bug for > > > this). > > > > >> > >> However, I > > > > >> > >> > > do > > > > >> > >> > > > >> not see the dead time you mention in either of the > last > > > two > > > > >> sets > > > > >> > >> of > > > > >> > >> > > > >> logs. It looks like each client instance submits a > > > continous > > > > >> > >> stream of > > > > >> > >> > > > >> jobs. > > > > >> > >> > > > >> > > > > >> > >> > > > >> So let's get back to the initial log. Can I have the > > > full > > > > >> > >> service log? > > > > >> > >> > > > >> I'm trying to track what happened with the jobs > > > submitted > > > > >> before > > > > >> > >> the > > > > >> > >> > > > >> first big pause. > > > > >> > >> > > > >> > > > > >> > >> > > > >> Also, a log message in > CoasterClient::updateJobStatus() > > > (or > > > > >> > >> friends) > > > > >> > >> > > > >> would probably help a lot here. > > > > >> > >> > > > >> > > > > >> > >> > > > >> Mihael > > > > >> > >> > > > >> > > > > >> > >> > > > >> On Thu, 2014-09-04 at 15:34 -0500, Tim Armstrong > wrote: > > > > >> > >> > > > >> > Should be here: > > > > >> > >> > > > >> > > > > > >> > >> > > > >> > > > > http://people.cs.uchicago.edu/~tga/worker-logs2.tar.gz > > > > >> > >> > > > >> > > > > > >> > >> > > > >> > > > > > >> > >> > > > >> > > > > > >> > >> > > > >> > > > > > >> > >> > > > >> > On Thu, Sep 4, 2014 at 3:03 PM, Mihael Hategan < > > > > >> > >> hategan at mcs.anl.gov > > > > >> > >> > > > > > > > >> > >> > > > >> wrote: > > > > >> > >> > > > >> > > > > > >> > >> > > > >> > > The first worker "failing" is 0904-20022331. > The log > > > > >> looks > > > > >> > >> funny > > > > >> > >> > > at > > > > >> > >> > > > >> the > > > > >> > >> > > > >> > > end. > > > > >> > >> > > > >> > > > > > > >> > >> > > > >> > > Can you git pull and re-run? The worker is > getting > > > some > > > > >> > >> command > > > > >> > >> > > at the > > > > >> > >> > > > >> > > end there and doing nothing about it and I > wonder > > > why. > > > > >> > >> > > > >> > > > > > > >> > >> > > > >> > > Mihael > > > > >> > >> > > > >> > > > > > > >> > >> > > > >> > > On Thu, 2014-09-04 at 14:35 -0500, Tim Armstrong > > > wrote: > > > > >> > >> > > > >> > > > Ok, now I have some worker logs: > > > > >> > >> > > > >> > > > > > > > >> > >> > > > >> > > > > > > > >> > >> > http://people.cs.uchicago.edu/~tga/2014-9-4-worker-logs.tar.gz > > > > >> > >> > > > >> > > > > > > > >> > >> > > > >> > > > There's nothing obvious I see in the worker > logs > > > that > > > > >> would > > > > >> > >> > > > >> indicate why > > > > >> > >> > > > >> > > > the connection was broken. > > > > >> > >> > > > >> > > > > > > > >> > >> > > > >> > > > - Tim > > > > >> > >> > > > >> > > > > > > > >> > >> > > > >> > > > > > > > >> > >> > > > >> > > > On Thu, Sep 4, 2014 at 1:11 PM, Tim Armstrong > < > > > > >> > >> > > > >> tim.g.armstrong at gmail.com > > > > >> > >> > > > >> > > > > > > > >> > >> > > > >> > > > wrote: > > > > >> > >> > > > >> > > > > > > > >> > >> > > > >> > > > > This is all running locally on my laptop, > so I > > > > >> think we > > > > >> > >> can > > > > >> > >> > > rule > > > > >> > >> > > > >> out > > > > >> > >> > > > >> > > 1). > > > > >> > >> > > > >> > > > > > > > > >> > >> > > > >> > > > > It also seems like it's a state the coaster > > > service > > > > >> gets > > > > >> > >> into > > > > >> > >> > > > >> after a > > > > >> > >> > > > >> > > few > > > > >> > >> > > > >> > > > > client sessions: generally the first > coaster run > > > > >> works > > > > >> > >> fine, > > > > >> > >> > > then > > > > >> > >> > > > >> > > after a > > > > >> > >> > > > >> > > > > few runs the problem occurs more frequently. > > > > >> > >> > > > >> > > > > > > > > >> > >> > > > >> > > > > I'm going to try and get worker logs, in the > > > > >> meantime > > > > >> > >> i've got > > > > >> > >> > > > >> some > > > > >> > >> > > > >> > > > > jstacks (attached). > > > > >> > >> > > > >> > > > > > > > > >> > >> > > > >> > > > > Matching service logs (largish) are here if > > > needed: > > > > >> > >> > > > >> > > > > > > > http://people.cs.uchicago.edu/~tga/service.out.gz > > > > >> > >> > > > >> > > > > > > > > >> > >> > > > >> > > > > > > > > >> > >> > > > >> > > > > On Wed, Sep 3, 2014 at 10:35 PM, Mihael > Hategan > > > < > > > > >> > >> > > > >> hategan at mcs.anl.gov> > > > > >> > >> > > > >> > > > > wrote: > > > > >> > >> > > > >> > > > > > > > > >> > >> > > > >> > > > >> Ah, makes sense. > > > > >> > >> > > > >> > > > >> > > > > >> > >> > > > >> > > > >> 2 minutes is the channel timeout. Each live > > > > >> connection > > > > >> > >> is > > > > >> > >> > > > >> guaranteed > > > > >> > >> > > > >> > > to > > > > >> > >> > > > >> > > > >> have some communication for any 2 minute > time > > > > >> window, > > > > >> > >> > > partially > > > > >> > >> > > > >> due to > > > > >> > >> > > > >> > > > >> periodic heartbeats (sent every 1 minute). > If > > > no > > > > >> > >> packets flow > > > > >> > >> > > > >> for the > > > > >> > >> > > > >> > > > >> duration of 2 minutes, the connection is > > > assumed > > > > >> broken > > > > >> > >> and > > > > >> > >> > > all > > > > >> > >> > > > >> jobs > > > > >> > >> > > > >> > > > >> that were submitted to the respective > workers > > > are > > > > >> > >> considered > > > > >> > >> > > > >> failed. > > > > >> > >> > > > >> > > So > > > > >> > >> > > > >> > > > >> there seems to be an issue with the > > > connections to > > > > >> some > > > > >> > >> of > > > > >> > >> > > the > > > > >> > >> > > > >> > > workers, > > > > >> > >> > > > >> > > > >> and it takes 2 minutes to detect them. > > > > >> > >> > > > >> > > > >> > > > > >> > >> > > > >> > > > >> Since the service seems to be alive > (although a > > > > >> jstack > > > > >> > >> on the > > > > >> > >> > > > >> service > > > > >> > >> > > > >> > > > >> when thing seem to hang might help), this > > > leaves > > > > >> two > > > > >> > >> > > > >> possibilities: > > > > >> > >> > > > >> > > > >> 1 - some genuine network problem > > > > >> > >> > > > >> > > > >> 2 - the worker died without properly > closing > > > TCP > > > > >> > >> connections > > > > >> > >> > > > >> > > > >> > > > > >> > >> > > > >> > > > >> If (2), you could enable worker logging > > > > >> > >> > > > >> > > > >> (Settings::Key::WORKER_LOGGING_LEVEL = > > > "DEBUG") to > > > > >> see > > > > >> > >> if > > > > >> > >> > > > >> anything > > > > >> > >> > > > >> > > shows > > > > >> > >> > > > >> > > > >> up. > > > > >> > >> > > > >> > > > >> > > > > >> > >> > > > >> > > > >> Mihael > > > > >> > >> > > > >> > > > >> > > > > >> > >> > > > >> > > > >> On Wed, 2014-09-03 at 20:26 -0500, Tim > > > Armstrong > > > > >> wrote: > > > > >> > >> > > > >> > > > >> > Here are client and service logs, with > part > > > of > > > > >> > >> service log > > > > >> > >> > > > >> edited > > > > >> > >> > > > >> > > down > > > > >> > >> > > > >> > > > >> to > > > > >> > >> > > > >> > > > >> > be a reasonable size (I have the full > thing > > > if > > > > >> > >> needed, but > > > > >> > >> > > it > > > > >> > >> > > > >> was > > > > >> > >> > > > >> > > over a > > > > >> > >> > > > >> > > > >> > gigabyte). > > > > >> > >> > > > >> > > > >> > > > > > >> > >> > > > >> > > > >> > One relevant section is from 19:49:35 > > > onwards. > > > > >> The > > > > >> > >> client > > > > >> > >> > > > >> submits 4 > > > > >> > >> > > > >> > > > >> jobs > > > > >> > >> > > > >> > > > >> > (its limit), but they don't complete > until > > > > >> 19:51:32 > > > > >> > >> or so > > > > >> > >> > > (I > > > > >> > >> > > > >> can see > > > > >> > >> > > > >> > > > >> that > > > > >> > >> > > > >> > > > >> > one task completed based on ncompleted=1 > in > > > the > > > > >> > >> > > check_tasks log > > > > >> > >> > > > >> > > > >> message). > > > > >> > >> > > > >> > > > >> > It looks like something has happened with > > > broken > > > > >> > >> pipes and > > > > >> > >> > > > >> workers > > > > >> > >> > > > >> > > being > > > > >> > >> > > > >> > > > >> > lost, but I'm not sure what the ultimate > > > cause of > > > > >> > >> that is > > > > >> > >> > > > >> likely to > > > > >> > >> > > > >> > > be. > > > > >> > >> > > > >> > > > >> > > > > > >> > >> > > > >> > > > >> > - Tim > > > > >> > >> > > > >> > > > >> > > > > > >> > >> > > > >> > > > >> > > > > > >> > >> > > > >> > > > >> > > > > > >> > >> > > > >> > > > >> > On Wed, Sep 3, 2014 at 6:20 PM, Mihael > > > Hategan < > > > > >> > >> > > > >> hategan at mcs.anl.gov > > > > >> > >> > > > >> > > > > > > > >> > >> > > > >> > > > >> wrote: > > > > >> > >> > > > >> > > > >> > > > > > >> > >> > > > >> > > > >> > > Hi Tim, > > > > >> > >> > > > >> > > > >> > > > > > > >> > >> > > > >> > > > >> > > I've never seen this before with pure > Java. > > > > >> > >> > > > >> > > > >> > > > > > > >> > >> > > > >> > > > >> > > Do you have logs from these runs? > > > > >> > >> > > > >> > > > >> > > > > > > >> > >> > > > >> > > > >> > > Mihael > > > > >> > >> > > > >> > > > >> > > > > > > >> > >> > > > >> > > > >> > > On Wed, 2014-09-03 at 16:49 -0500, Tim > > > > >> Armstrong > > > > >> > >> wrote: > > > > >> > >> > > > >> > > > >> > > > I'm running a test Swift/T script > that > > > submit > > > > >> > >> tasks to > > > > >> > >> > > > >> Coasters > > > > >> > >> > > > >> > > > >> through > > > > >> > >> > > > >> > > > >> > > the > > > > >> > >> > > > >> > > > >> > > > C++ client and I'm seeing some odd > > > behaviour > > > > >> > >> where task > > > > >> > >> > > > >> > > > >> > > > submission/execution is stalling for > ~2 > > > > >> minute > > > > >> > >> periods. > > > > >> > >> > > > >> For > > > > >> > >> > > > >> > > > >> example, I'm > > > > >> > >> > > > >> > > > >> > > > seeing submit log messages like > > > "submitting > > > > >> > >> > > > >> > > > >> > > > urn:133-1409778135377-1409778135378: > > > > >> > >> /bin/hostname" in > > > > >> > >> > > > >> bursts of > > > > >> > >> > > > >> > > > >> several > > > > >> > >> > > > >> > > > >> > > > seconds with a gap of roughly 2 > minutes > > > in > > > > >> > >> between, > > > > >> > >> > > e.g. > > > > >> > >> > > > >> I'm > > > > >> > >> > > > >> > > seeing > > > > >> > >> > > > >> > > > >> > > bursts > > > > >> > >> > > > >> > > > >> > > > with the following intervals in my > logs. > > > > >> > >> > > > >> > > > >> > > > > > > > >> > >> > > > >> > > > >> > > > 16:07:04,603 to 16:07:10,391 > > > > >> > >> > > > >> > > > >> > > > 16:09:07,377 to 16:09:13,076 > > > > >> > >> > > > >> > > > >> > > > 16:11:10,005 to 16:11:16,770 > > > > >> > >> > > > >> > > > >> > > > 16:13:13,291 to 16:13:19,296 > > > > >> > >> > > > >> > > > >> > > > 16:15:16,000 to 16:15:21,602 > > > > >> > >> > > > >> > > > >> > > > > > > > >> > >> > > > >> > > > >> > > > From what I can tell, the delay is > on the > > > > >> coaster > > > > >> > >> > > service > > > > >> > >> > > > >> side: > > > > >> > >> > > > >> > > the > > > > >> > >> > > > >> > > > >> C > > > > >> > >> > > > >> > > > >> > > > client is just waiting for a > response. > > > > >> > >> > > > >> > > > >> > > > > > > > >> > >> > > > >> > > > >> > > > The jobs are just being submitted > > > through the > > > > >> > >> local job > > > > >> > >> > > > >> > > manager, so > > > > >> > >> > > > >> > > > >> I > > > > >> > >> > > > >> > > > >> > > > wouldn't expect any delays there. > The > > > tasks > > > > >> are > > > > >> > >> also > > > > >> > >> > > just > > > > >> > >> > > > >> > > > >> > > "/bin/hostname", > > > > >> > >> > > > >> > > > >> > > > so should return immediately. > > > > >> > >> > > > >> > > > >> > > > > > > > >> > >> > > > >> > > > >> > > > I'm going to continue digging into > this > > > on my > > > > >> > >> own, but > > > > >> > >> > > the > > > > >> > >> > > > >> 2 > > > > >> > >> > > > >> > > minute > > > > >> > >> > > > >> > > > >> delay > > > > >> > >> > > > >> > > > >> > > > seems like a big clue: does anyone > have > > > an > > > > >> idea > > > > >> > >> what > > > > >> > >> > > could > > > > >> > >> > > > >> cause > > > > >> > >> > > > >> > > > >> stalls > > > > >> > >> > > > >> > > > >> > > in > > > > >> > >> > > > >> > > > >> > > > task submission of 2 minute duration? > > > > >> > >> > > > >> > > > >> > > > > > > > >> > >> > > > >> > > > >> > > > Cheers, > > > > >> > >> > > > >> > > > >> > > > Tim > > > > >> > >> > > > >> > > > >> > > > > > > >> > >> > > > >> > > > >> > > > > > > >> > >> > > > >> > > > >> > > > > > > >> > >> > > > >> > > > >> > > > > >> > >> > > > >> > > > >> > > > > >> > >> > > > >> > > > >> > > > > >> > >> > > > >> > > > > > > > > >> > >> > > > >> > > > > > > >> > >> > > > >> > > > > > > >> > >> > > > >> > > > > > > >> > >> > > > >> > > > > >> > >> > > > >> > > > > >> > >> > > > >> > > > > >> > >> > > > > > > > > >> > >> > > > > > > >> > >> > > > > > > >> > >> > > > > > > >> > >> > > > > >> > >> > > > > >> > >> > > > > >> > > > > > > >> > > > > >> > > > > >> > > > > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Sep 11 13:58:28 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 11 Sep 2014 11:58:28 -0700 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo> <1409860986.7960.3.camel@echo> <1409876838.3600.8.camel@echo> <1409939838.12288.3.camel@echo> <1410205101.24345.22.camel@echo> <1410457173.25856.5.camel@echo> <1410459811.27191.5.camel@echo> <1410461068.27191.7.camel@echo> Message-ID: <1410461908.31274.1.camel@echo> Can you try automatic workers? Passive workers is something I didn't quite get to yet. I believe that they should only be allowed in shared mode. Thoughts? Mihael On Thu, 2014-09-11 at 13:54 -0500, Tim Armstrong wrote: > Yeah, local workers, this was started with start-coaster-service with conf > file: > > export WORKER_MODE=local > export IPADDR=127.0.0.1 > export SERVICE_PORT=53363 > export JOBSPERNODE=4 > export LOGDIR=$(pwd) > > > - Tim > > On Thu, Sep 11, 2014 at 1:44 PM, Mihael Hategan wrote: > > > Passive workers? > > > > On Thu, 2014-09-11 at 13:26 -0500, Tim Armstrong wrote: > > > Oops, forgot about that > > > > > > > > > > > > On Thu, Sep 11, 2014 at 1:23 PM, Mihael Hategan > > wrote: > > > > > > > The coaster logging was broken, and that brokenness caused it to print > > > > everything on stdout. That got fixed, so the actual log is now > > > > in ./cps*.log. > > > > > > > > So I probably need that log. > > > > > > > > Mihael > > > > > > > > On Thu, 2014-09-11 at 13:10 -0500, Tim Armstrong wrote: > > > > > I meant the github master., but it turns out that I had had the wrong > > > > Swift > > > > > on my path. Apologies for the confusion. > > > > > > > > > > I've rerun with the current one. > > > > > > > > > > I'm getting a null pointer exception on line 226 of > > > > > BlockQueueProcessor.java. Adding some printfs revealed that > > settings was > > > > > null. > > > > > > > > > > Log attached. > > > > > > > > > > - Tim > > > > > > > > > > Job: Job(id:0 600.000s) > > > > > Settings: null > > > > > java.lang.NullPointerException > > > > > at > > > > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.checkJob(BlockQueueProcessor.java:228) > > > > > at > > > > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.enqueue1(BlockQueueProcessor.java:210) > > > > > at > > > > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.enqueue(BlockQueueProcessor.java:204) > > > > > at > > > > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.enqueue(JobQueue.java:103) > > > > > at > > > > > > > > > > > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:96) > > > > > at > > > > > > > > > > > org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:112) > > > > > at > > > > > > > > > > > org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:589) > > > > > at > > > > > > > > > > > org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:175) > > > > > at > > > > org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:90) > > > > > > > > > > On Thu, Sep 11, 2014 at 12:41 PM, Tim Armstrong < > > > > tim.g.armstrong at gmail.com> > > > > > wrote: > > > > > > > > > > > I thought I was running the latest trunk, I'll rebuild and see if > > I can > > > > > > reproduce the issue. > > > > > > > > > > > > - Tim > > > > > > > > > > > > On Thu, Sep 11, 2014 at 12:39 PM, Mihael Hategan < > > hategan at mcs.anl.gov> > > > > > > wrote: > > > > > > > > > > > >> The method "getMetaChannel()" has been removed. Where did you get > > the > > > > > >> code from? > > > > > >> > > > > > >> Mihael > > > > > >> > > > > > >> On Thu, 2014-09-11 at 12:16 -0500, Tim Armstrong wrote: > > > > > >> > I'm seeing failures when running Swift/T tests with > > > > > >> > start-coaster-service.sh. > > > > > >> > > > > > > >> > E.g. the turbine test coaster-exec-1. I can provide > > instructions > > > > for > > > > > >> > running the test if needed (roughly, you need to build Swift/T > > with > > > > > >> coaster > > > > > >> > support enabled, then make tests/coaster-exec-1.result in the > > > > turbine > > > > > >> > directory). The github swift-t release is up to date if you > > want > > > > to use > > > > > >> > that. > > > > > >> > > > > > > >> > Full log is attached, stack trace excerpt is below. > > > > > >> > > > > > > >> > - Tim > > > > > >> > > > > > > >> > 2014-09-11 12:11:13,708-0500 INFO BlockQueueProcessor > > Starting... > > > > > >> > id=0911-1112130 > > > > > >> > Using threaded sender for TCPChannel [type: server, contact: > > > > > >> 127.0.0.1:48242 > > > > > >> > ] > > > > > >> > 2014-09-11 12:11:13,708-0500 INFO AbstractStreamCoasterChannel > > > > Using > > > > > >> > threaded sender for TCPChannel [type: server, contact: > > > > 127.0.0.1:48242] > > > > > >> > org.globus.cog.coaster.channels.ChannelException: Invalid > > channel: > > > > null > > > > > >> > @id://null-nullS > > > > > >> > at > > > > > >> > > > > > > >> > > > > > > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) > > > > > >> > at > > > > > >> > > > > > > >> > > > > > > org.globus.cog.coaster.channels.ChannelManager.reserveChannel(ChannelManager.java:226) > > > > > >> > at > > > > > >> > > > > > > >> > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.PassiveQueueProcessor.setClientChannelContext(PassiveQueueProcessor.java:41) > > > > > >> > at > > > > > >> > > > > > > >> > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.setClientChannelContext(JobQueue.java:135) > > > > > >> > at > > > > > >> > > > > > > >> > > > > > > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:77) > > > > > >> > at > > > > > >> > > > > > > >> > > > > > > org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) > > > > > >> > at > > > > > >> > > > > > > >> > > > > > > org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) > > > > > >> > at > > > > > >> > > > > > > >> > > > > > > org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) > > > > > >> > at > > > > > >> > > org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) > > > > > >> > provider=local > > > > > >> > 2014-09-11 12:11:13,930-0500 INFO ExecutionTaskHandler > > > > provider=local > > > > > >> > org.globus.cog.coaster.channels.ChannelException: Invalid > > channel: > > > > null > > > > > >> > @id://null-nullS > > > > > >> > at > > > > > >> > > > > > > >> > > > > > > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) > > > > > >> > at > > > > > >> > > > > > > >> > > > > > > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:432) > > > > > >> > at > > > > > >> > > > > > > >> > > > > > > org.globus.cog.coaster.channels.ChannelManager.reserveLongTerm(ChannelManager.java:472) > > > > > >> > at > > > > > >> > > > > > > >> > > > > > > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:80) > > > > > >> > at > > > > > >> > > > > > > >> > > > > > > org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) > > > > > >> > at > > > > > >> > > > > > > >> > > > > > > org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) > > > > > >> > at > > > > > >> > > > > > > >> > > > > > > org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) > > > > > >> > at > > > > > >> > > org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) > > > > > >> > Handler(tag: 38907, SUBMITJOB) sending error: Could not > > deserialize > > > > job > > > > > >> > description > > > > > >> > org.globus.cog.coaster.ProtocolException: Could not deserialize > > job > > > > > >> > description > > > > > >> > at > > > > > >> > > > > > > >> > > > > > > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:84) > > > > > >> > at > > > > > >> > > > > > > >> > > > > > > org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) > > > > > >> > at > > > > > >> > > > > > > >> > > > > > > org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) > > > > > >> > at > > > > > >> > > > > > > >> > > > > > > org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) > > > > > >> > at > > > > > >> > > org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) > > > > > >> > Caused by: org.globus.cog.coaster.channels.ChannelException: > > Invalid > > > > > >> > channel: null at id://null-nullS > > > > > >> > at > > > > > >> > > > > > > >> > > > > > > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) > > > > > >> > at > > > > > >> > > > > > > >> > > > > > > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:432) > > > > > >> > at > > > > > >> > > > > > > >> > > > > > > org.globus.cog.coaster.channels.ChannelManager.reserveLongTerm(ChannelManager.java:472) > > > > > >> > at > > > > > >> > > > > > > >> > > > > > > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:80) > > > > > >> > ... 4 more > > > > > >> > 2014-09-11 12:11:13,937-0500 INFO RequestReply Handler(tag: > > 38907, > > > > > >> > SUBMITJOB) sending error: Could not deserialize job description > > > > > >> > org.globus.cog.coaster.ProtocolException: Could not deserialize > > job > > > > > >> > description > > > > > >> > at > > > > > >> > > > > > > >> > > > > > > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:84) > > > > > >> > at > > > > > >> > > > > > > >> > > > > > > org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) > > > > > >> > at > > > > > >> > > > > > > >> > > > > > > org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) > > > > > >> > at > > > > > >> > > > > > > >> > > > > > > org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) > > > > > >> > at > > > > > >> > > org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) > > > > > >> > Caused by: org.globus.cog.coaster.channels.ChannelException: > > Invalid > > > > > >> > channel: null at id://null-nullS > > > > > >> > at > > > > > >> > > > > > > >> > > > > > > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) > > > > > >> > at > > > > > >> > > > > > > >> > > > > > > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:432) > > > > > >> > at > > > > > >> > > > > > > >> > > > > > > org.globus.cog.coaster.channels.ChannelManager.reserveLongTerm(ChannelManager.java:472) > > > > > >> > at > > > > > >> > > > > > > >> > > > > > > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:80) > > > > > >> > ... 4 more > > > > > >> > > > > > > >> > > > > > > >> > On Thu, Sep 11, 2014 at 10:30 AM, Tim Armstrong < > > > > > >> tim.g.armstrong at gmail.com> > > > > > >> > wrote: > > > > > >> > > > > > > >> > > This all sounds great. > > > > > >> > > > > > > > >> > > Just to check that I've understood correctly, from the > > client's > > > > point > > > > > >> of > > > > > >> > > view: > > > > > >> > > * The per-client settings behave the same if -shared is not > > > > provided. > > > > > >> > > * Per-client settings are ignored if -shared is provided > > > > > >> > > > > > > > >> > > I had one question: > > > > > >> > > * Do automatically allocated workers work with per-client > > > > settings? I > > > > > >> > > understand there were some issues related to sharing workers > > > > between > > > > > >> > > clients. Was the solution to have separate worker pools, or > > is > > > > this > > > > > >> just > > > > > >> > > not supported? > > > > > >> > > > > > > > >> > > - Tim > > > > > >> > > > > > > > >> > > On Mon, Sep 8, 2014 at 2:38 PM, Mihael Hategan < > > > > hategan at mcs.anl.gov> > > > > > >> > > wrote: > > > > > >> > > > > > > > >> > >> So... > > > > > >> > >> > > > > > >> > >> There were bugs. Lots of bugs. > > > > > >> > >> I did some work over the weekend to fix some of these and > > clean > > > > up > > > > > >> the > > > > > >> > >> coaster code. Here's a summary: > > > > > >> > >> > > > > > >> > >> - there was some stuff in the low level coaster code to deal > > with > > > > > >> > >> persisting coaster channels over multiple connections with > > > > various > > > > > >> > >> options, like periodic connections, client or server > > initiated > > > > > >> > >> connections, buffering of commands, etc. None of this was > > used by > > > > > >> Swift, > > > > > >> > >> and the code was pretty messy. I removed that. > > > > > >> > >> - there were some issues with multiple clients: > > > > > >> > >> * improper shutdown of relevant workers when a client > > > > disconnected > > > > > >> > >> * the worker task dispatcher was a singleton and had a > > > > reference to > > > > > >> > >> one block allocator, whereas multiple clients involved > > multiple > > > > > >> > >> allocators. > > > > > >> > >> - there were a bunch of locking issues in the C client that > > > > valgrind > > > > > >> > >> caught > > > > > >> > >> - the idea of remote job ids was a bit hard to work with. > > This > > > > > >> remote id > > > > > >> > >> was the job id that the service assigned to a job. This is > > > > necessary > > > > > >> > >> because two different clients can submit jobs with the same > > id. > > > > The > > > > > >> > >> remote id would be communicated to the client as the reply > > to the > > > > > >> submit > > > > > >> > >> request. However, it was entirely possible for a notification > > > > about > > > > > >> job > > > > > >> > >> status to be sent to the client before the submit reply was. > > > > Since > > > > > >> > >> notifications were sent using the remote-id, the client would > > > > have no > > > > > >> > >> idea what job the notifications belonged to. Now, the server > > > > might > > > > > >> need > > > > > >> > >> a unique job id, but there is no reason why it cannot use the > > > > client > > > > > >> id > > > > > >> > >> when communicating the status to a client. So that's there > > now. > > > > > >> > >> - the way the C client was working, its jobs ended up not > > going > > > > to > > > > > >> the > > > > > >> > >> workers, but the local queue. The service settings now allow > > > > > >> specifying > > > > > >> > >> the provider/jobManager/url to be used to start blocks, and > > jobs > > > > are > > > > > >> > >> routed appropriately if they do not have the batch job flag > > set. > > > > > >> > >> > > > > > >> > >> I also added a shared service mode. We discussed this before. > > > > > >> Basically > > > > > >> > >> you start the coaster service with "-shared > > > > " and > > > > > >> > >> all the settings are read from that file. In this case, all > > > > clients > > > > > >> > >> share the same worker pool, and client settings are ignored. > > > > > >> > >> > > > > > >> > >> The C client now has a multi-job testing tool which can > > submit > > > > many > > > > > >> jobs > > > > > >> > >> with the desired level of concurrency. > > > > > >> > >> > > > > > >> > >> I have tested the C client with both shared and non-shared > > mode, > > > > with > > > > > >> > >> various levels of jobs being sent, with either one or two > > > > concurrent > > > > > >> > >> clients. > > > > > >> > >> > > > > > >> > >> I haven't tested manual workers. > > > > > >> > >> > > > > > >> > >> I've also decided that during normal operation (i.e. client > > > > connects, > > > > > >> > >> submits jobs, shuts down gracefully), there should be no > > > > exceptions > > > > > >> in > > > > > >> > >> the coaster log. I think we should stick to that principle. > > This > > > > was > > > > > >> the > > > > > >> > >> case last I tested, and we should consider any deviation from > > > > that > > > > > >> to be > > > > > >> > >> a problem. Of course, there are some things for which there > > is no > > > > > >> > >> graceful shut down, such as ctrl+C-ing a manual worker. > > > > Exceptions > > > > > >> are > > > > > >> > >> fine in that case. > > > > > >> > >> > > > > > >> > >> So anyway, let's start from here. > > > > > >> > >> > > > > > >> > >> Mihael > > > > > >> > >> > > > > > >> > >> On Fri, 2014-09-05 at 13:09 -0500, Tim Armstrong wrote: > > > > > >> > >> > Thanks, let me know if there's anything I can help do. > > > > > >> > >> > > > > > > >> > >> > - Tim > > > > > >> > >> > > > > > > >> > >> > > > > > > >> > >> > On Fri, Sep 5, 2014 at 12:57 PM, Mihael Hategan < > > > > > >> hategan at mcs.anl.gov> > > > > > >> > >> wrote: > > > > > >> > >> > > > > > > >> > >> > > Thanks. It also seems that there is an older bug in > > there in > > > > > >> which the > > > > > >> > >> > > client connection is not properly accounted for and > > things > > > > start > > > > > >> > >> failing > > > > > >> > >> > > two minutes after the client connects (which is also > > probably > > > > > >> why you > > > > > >> > >> > > didn't see this in runs with many short client > > connections). > > > > I'm > > > > > >> not > > > > > >> > >> > > sure why the fix for that bug isn't in the trunk code. > > > > > >> > >> > > > > > > > >> > >> > > In any event, I'll set up a client submission loop and > > fix > > > > all > > > > > >> these > > > > > >> > >> > > things. > > > > > >> > >> > > > > > > > >> > >> > > Mihael > > > > > >> > >> > > > > > > > >> > >> > > On Fri, 2014-09-05 at 12:13 -0500, Tim Armstrong wrote: > > > > > >> > >> > > > Ok, here it is with the additional debug messages. > > Source > > > > code > > > > > >> > >> change is > > > > > >> > >> > > > in commit 890c41f2ba701b10264553471590096d6f94c278. > > > > > >> > >> > > > > > > > > >> > >> > > > Warning: the tarball will expand to several gigabytes > > of > > > > logs. > > > > > >> > >> > > > > > > > > >> > >> > > > I had to do multiple client runs to trigger it. It > > seems > > > > like > > > > > >> the > > > > > >> > >> > > problem > > > > > >> > >> > > > might be triggered by abnormal termination of the > > client. > > > > > >> First 18 > > > > > >> > >> runs > > > > > >> > >> > > > went fine, problem only started when I ctrl-c-ed the > > > > swift/t > > > > > >> run #19 > > > > > >> > >> > > before > > > > > >> > >> > > > the run #20 that exhibited delays. > > > > > >> > >> > > > > > > > > >> > >> > > > > > > > http://people.cs.uchicago.edu/~tga/files/worker-logs3.tar.gz > > > > > >> > >> > > > > > > > > >> > >> > > > - Tim > > > > > >> > >> > > > > > > > > >> > >> > > > > > > > > >> > >> > > > On Fri, Sep 5, 2014 at 8:55 AM, Tim Armstrong < > > > > > >> > >> tim.g.armstrong at gmail.com > > > > > >> > >> > > > > > > > > >> > >> > > > wrote: > > > > > >> > >> > > > > > > > > >> > >> > > > > It's here: > > > > > >> > >> > > > > > > > > > >> > >> > > > > http://people.cs.uchicago.edu/~tga/files/coaster-service.out.full.gz > > > > > >> . > > > > > >> > >> > > > > > > > > > >> > >> > > > > I'll add some extra debug messages in the coaster C++ > > > > client > > > > > >> and > > > > > >> > >> see > > > > > >> > >> > > if I > > > > > >> > >> > > > > can recreate the scenario. > > > > > >> > >> > > > > > > > > > >> > >> > > > > - Tim > > > > > >> > >> > > > > > > > > > >> > >> > > > > > > > > > >> > >> > > > > On Thu, Sep 4, 2014 at 7:27 PM, Mihael Hategan < > > > > > >> > >> hategan at mcs.anl.gov> > > > > > >> > >> > > > > wrote: > > > > > >> > >> > > > > > > > > > >> > >> > > > >> Ok, so that's legit. > > > > > >> > >> > > > >> > > > > > >> > >> > > > >> It does look like shut down workers are not properly > > > > > >> accounted > > > > > >> > >> for in > > > > > >> > >> > > > >> some places (and I believe Yadu submitted a bug for > > > > this). > > > > > >> > >> However, I > > > > > >> > >> > > do > > > > > >> > >> > > > >> not see the dead time you mention in either of the > > last > > > > two > > > > > >> sets > > > > > >> > >> of > > > > > >> > >> > > > >> logs. It looks like each client instance submits a > > > > continous > > > > > >> > >> stream of > > > > > >> > >> > > > >> jobs. > > > > > >> > >> > > > >> > > > > > >> > >> > > > >> So let's get back to the initial log. Can I have the > > > > full > > > > > >> > >> service log? > > > > > >> > >> > > > >> I'm trying to track what happened with the jobs > > > > submitted > > > > > >> before > > > > > >> > >> the > > > > > >> > >> > > > >> first big pause. > > > > > >> > >> > > > >> > > > > > >> > >> > > > >> Also, a log message in > > CoasterClient::updateJobStatus() > > > > (or > > > > > >> > >> friends) > > > > > >> > >> > > > >> would probably help a lot here. > > > > > >> > >> > > > >> > > > > > >> > >> > > > >> Mihael > > > > > >> > >> > > > >> > > > > > >> > >> > > > >> On Thu, 2014-09-04 at 15:34 -0500, Tim Armstrong > > wrote: > > > > > >> > >> > > > >> > Should be here: > > > > > >> > >> > > > >> > > > > > > >> > >> > > > >> > > > > > http://people.cs.uchicago.edu/~tga/worker-logs2.tar.gz > > > > > >> > >> > > > >> > > > > > > >> > >> > > > >> > > > > > > >> > >> > > > >> > > > > > > >> > >> > > > >> > > > > > > >> > >> > > > >> > On Thu, Sep 4, 2014 at 3:03 PM, Mihael Hategan < > > > > > >> > >> hategan at mcs.anl.gov > > > > > >> > >> > > > > > > > > >> > >> > > > >> wrote: > > > > > >> > >> > > > >> > > > > > > >> > >> > > > >> > > The first worker "failing" is 0904-20022331. > > The log > > > > > >> looks > > > > > >> > >> funny > > > > > >> > >> > > at > > > > > >> > >> > > > >> the > > > > > >> > >> > > > >> > > end. > > > > > >> > >> > > > >> > > > > > > > >> > >> > > > >> > > Can you git pull and re-run? The worker is > > getting > > > > some > > > > > >> > >> command > > > > > >> > >> > > at the > > > > > >> > >> > > > >> > > end there and doing nothing about it and I > > wonder > > > > why. > > > > > >> > >> > > > >> > > > > > > > >> > >> > > > >> > > Mihael > > > > > >> > >> > > > >> > > > > > > > >> > >> > > > >> > > On Thu, 2014-09-04 at 14:35 -0500, Tim Armstrong > > > > wrote: > > > > > >> > >> > > > >> > > > Ok, now I have some worker logs: > > > > > >> > >> > > > >> > > > > > > > > >> > >> > > > >> > > > > > > > > >> > >> > > http://people.cs.uchicago.edu/~tga/2014-9-4-worker-logs.tar.gz > > > > > >> > >> > > > >> > > > > > > > > >> > >> > > > >> > > > There's nothing obvious I see in the worker > > logs > > > > that > > > > > >> would > > > > > >> > >> > > > >> indicate why > > > > > >> > >> > > > >> > > > the connection was broken. > > > > > >> > >> > > > >> > > > > > > > > >> > >> > > > >> > > > - Tim > > > > > >> > >> > > > >> > > > > > > > > >> > >> > > > >> > > > > > > > > >> > >> > > > >> > > > On Thu, Sep 4, 2014 at 1:11 PM, Tim Armstrong > > < > > > > > >> > >> > > > >> tim.g.armstrong at gmail.com > > > > > >> > >> > > > >> > > > > > > > > >> > >> > > > >> > > > wrote: > > > > > >> > >> > > > >> > > > > > > > > >> > >> > > > >> > > > > This is all running locally on my laptop, > > so I > > > > > >> think we > > > > > >> > >> can > > > > > >> > >> > > rule > > > > > >> > >> > > > >> out > > > > > >> > >> > > > >> > > 1). > > > > > >> > >> > > > >> > > > > > > > > > >> > >> > > > >> > > > > It also seems like it's a state the coaster > > > > service > > > > > >> gets > > > > > >> > >> into > > > > > >> > >> > > > >> after a > > > > > >> > >> > > > >> > > few > > > > > >> > >> > > > >> > > > > client sessions: generally the first > > coaster run > > > > > >> works > > > > > >> > >> fine, > > > > > >> > >> > > then > > > > > >> > >> > > > >> > > after a > > > > > >> > >> > > > >> > > > > few runs the problem occurs more frequently. > > > > > >> > >> > > > >> > > > > > > > > > >> > >> > > > >> > > > > I'm going to try and get worker logs, in the > > > > > >> meantime > > > > > >> > >> i've got > > > > > >> > >> > > > >> some > > > > > >> > >> > > > >> > > > > jstacks (attached). > > > > > >> > >> > > > >> > > > > > > > > > >> > >> > > > >> > > > > Matching service logs (largish) are here if > > > > needed: > > > > > >> > >> > > > >> > > > > > > > > http://people.cs.uchicago.edu/~tga/service.out.gz > > > > > >> > >> > > > >> > > > > > > > > > >> > >> > > > >> > > > > > > > > > >> > >> > > > >> > > > > On Wed, Sep 3, 2014 at 10:35 PM, Mihael > > Hategan > > > > < > > > > > >> > >> > > > >> hategan at mcs.anl.gov> > > > > > >> > >> > > > >> > > > > wrote: > > > > > >> > >> > > > >> > > > > > > > > > >> > >> > > > >> > > > >> Ah, makes sense. > > > > > >> > >> > > > >> > > > >> > > > > > >> > >> > > > >> > > > >> 2 minutes is the channel timeout. Each live > > > > > >> connection > > > > > >> > >> is > > > > > >> > >> > > > >> guaranteed > > > > > >> > >> > > > >> > > to > > > > > >> > >> > > > >> > > > >> have some communication for any 2 minute > > time > > > > > >> window, > > > > > >> > >> > > partially > > > > > >> > >> > > > >> due to > > > > > >> > >> > > > >> > > > >> periodic heartbeats (sent every 1 minute). > > If > > > > no > > > > > >> > >> packets flow > > > > > >> > >> > > > >> for the > > > > > >> > >> > > > >> > > > >> duration of 2 minutes, the connection is > > > > assumed > > > > > >> broken > > > > > >> > >> and > > > > > >> > >> > > all > > > > > >> > >> > > > >> jobs > > > > > >> > >> > > > >> > > > >> that were submitted to the respective > > workers > > > > are > > > > > >> > >> considered > > > > > >> > >> > > > >> failed. > > > > > >> > >> > > > >> > > So > > > > > >> > >> > > > >> > > > >> there seems to be an issue with the > > > > connections to > > > > > >> some > > > > > >> > >> of > > > > > >> > >> > > the > > > > > >> > >> > > > >> > > workers, > > > > > >> > >> > > > >> > > > >> and it takes 2 minutes to detect them. > > > > > >> > >> > > > >> > > > >> > > > > > >> > >> > > > >> > > > >> Since the service seems to be alive > > (although a > > > > > >> jstack > > > > > >> > >> on the > > > > > >> > >> > > > >> service > > > > > >> > >> > > > >> > > > >> when thing seem to hang might help), this > > > > leaves > > > > > >> two > > > > > >> > >> > > > >> possibilities: > > > > > >> > >> > > > >> > > > >> 1 - some genuine network problem > > > > > >> > >> > > > >> > > > >> 2 - the worker died without properly > > closing > > > > TCP > > > > > >> > >> connections > > > > > >> > >> > > > >> > > > >> > > > > > >> > >> > > > >> > > > >> If (2), you could enable worker logging > > > > > >> > >> > > > >> > > > >> (Settings::Key::WORKER_LOGGING_LEVEL = > > > > "DEBUG") to > > > > > >> see > > > > > >> > >> if > > > > > >> > >> > > > >> anything > > > > > >> > >> > > > >> > > shows > > > > > >> > >> > > > >> > > > >> up. > > > > > >> > >> > > > >> > > > >> > > > > > >> > >> > > > >> > > > >> Mihael > > > > > >> > >> > > > >> > > > >> > > > > > >> > >> > > > >> > > > >> On Wed, 2014-09-03 at 20:26 -0500, Tim > > > > Armstrong > > > > > >> wrote: > > > > > >> > >> > > > >> > > > >> > Here are client and service logs, with > > part > > > > of > > > > > >> > >> service log > > > > > >> > >> > > > >> edited > > > > > >> > >> > > > >> > > down > > > > > >> > >> > > > >> > > > >> to > > > > > >> > >> > > > >> > > > >> > be a reasonable size (I have the full > > thing > > > > if > > > > > >> > >> needed, but > > > > > >> > >> > > it > > > > > >> > >> > > > >> was > > > > > >> > >> > > > >> > > over a > > > > > >> > >> > > > >> > > > >> > gigabyte). > > > > > >> > >> > > > >> > > > >> > > > > > > >> > >> > > > >> > > > >> > One relevant section is from 19:49:35 > > > > onwards. > > > > > >> The > > > > > >> > >> client > > > > > >> > >> > > > >> submits 4 > > > > > >> > >> > > > >> > > > >> jobs > > > > > >> > >> > > > >> > > > >> > (its limit), but they don't complete > > until > > > > > >> 19:51:32 > > > > > >> > >> or so > > > > > >> > >> > > (I > > > > > >> > >> > > > >> can see > > > > > >> > >> > > > >> > > > >> that > > > > > >> > >> > > > >> > > > >> > one task completed based on ncompleted=1 > > in > > > > the > > > > > >> > >> > > check_tasks log > > > > > >> > >> > > > >> > > > >> message). > > > > > >> > >> > > > >> > > > >> > It looks like something has happened with > > > > broken > > > > > >> > >> pipes and > > > > > >> > >> > > > >> workers > > > > > >> > >> > > > >> > > being > > > > > >> > >> > > > >> > > > >> > lost, but I'm not sure what the ultimate > > > > cause of > > > > > >> > >> that is > > > > > >> > >> > > > >> likely to > > > > > >> > >> > > > >> > > be. > > > > > >> > >> > > > >> > > > >> > > > > > > >> > >> > > > >> > > > >> > - Tim > > > > > >> > >> > > > >> > > > >> > > > > > > >> > >> > > > >> > > > >> > > > > > > >> > >> > > > >> > > > >> > > > > > > >> > >> > > > >> > > > >> > On Wed, Sep 3, 2014 at 6:20 PM, Mihael > > > > Hategan < > > > > > >> > >> > > > >> hategan at mcs.anl.gov > > > > > >> > >> > > > >> > > > > > > > > >> > >> > > > >> > > > >> wrote: > > > > > >> > >> > > > >> > > > >> > > > > > > >> > >> > > > >> > > > >> > > Hi Tim, > > > > > >> > >> > > > >> > > > >> > > > > > > > >> > >> > > > >> > > > >> > > I've never seen this before with pure > > Java. > > > > > >> > >> > > > >> > > > >> > > > > > > > >> > >> > > > >> > > > >> > > Do you have logs from these runs? > > > > > >> > >> > > > >> > > > >> > > > > > > > >> > >> > > > >> > > > >> > > Mihael > > > > > >> > >> > > > >> > > > >> > > > > > > > >> > >> > > > >> > > > >> > > On Wed, 2014-09-03 at 16:49 -0500, Tim > > > > > >> Armstrong > > > > > >> > >> wrote: > > > > > >> > >> > > > >> > > > >> > > > I'm running a test Swift/T script > > that > > > > submit > > > > > >> > >> tasks to > > > > > >> > >> > > > >> Coasters > > > > > >> > >> > > > >> > > > >> through > > > > > >> > >> > > > >> > > > >> > > the > > > > > >> > >> > > > >> > > > >> > > > C++ client and I'm seeing some odd > > > > behaviour > > > > > >> > >> where task > > > > > >> > >> > > > >> > > > >> > > > submission/execution is stalling for > > ~2 > > > > > >> minute > > > > > >> > >> periods. > > > > > >> > >> > > > >> For > > > > > >> > >> > > > >> > > > >> example, I'm > > > > > >> > >> > > > >> > > > >> > > > seeing submit log messages like > > > > "submitting > > > > > >> > >> > > > >> > > > >> > > > urn:133-1409778135377-1409778135378: > > > > > >> > >> /bin/hostname" in > > > > > >> > >> > > > >> bursts of > > > > > >> > >> > > > >> > > > >> several > > > > > >> > >> > > > >> > > > >> > > > seconds with a gap of roughly 2 > > minutes > > > > in > > > > > >> > >> between, > > > > > >> > >> > > e.g. > > > > > >> > >> > > > >> I'm > > > > > >> > >> > > > >> > > seeing > > > > > >> > >> > > > >> > > > >> > > bursts > > > > > >> > >> > > > >> > > > >> > > > with the following intervals in my > > logs. > > > > > >> > >> > > > >> > > > >> > > > > > > > > >> > >> > > > >> > > > >> > > > 16:07:04,603 to 16:07:10,391 > > > > > >> > >> > > > >> > > > >> > > > 16:09:07,377 to 16:09:13,076 > > > > > >> > >> > > > >> > > > >> > > > 16:11:10,005 to 16:11:16,770 > > > > > >> > >> > > > >> > > > >> > > > 16:13:13,291 to 16:13:19,296 > > > > > >> > >> > > > >> > > > >> > > > 16:15:16,000 to 16:15:21,602 > > > > > >> > >> > > > >> > > > >> > > > > > > > > >> > >> > > > >> > > > >> > > > From what I can tell, the delay is > > on the > > > > > >> coaster > > > > > >> > >> > > service > > > > > >> > >> > > > >> side: > > > > > >> > >> > > > >> > > the > > > > > >> > >> > > > >> > > > >> C > > > > > >> > >> > > > >> > > > >> > > > client is just waiting for a > > response. > > > > > >> > >> > > > >> > > > >> > > > > > > > > >> > >> > > > >> > > > >> > > > The jobs are just being submitted > > > > through the > > > > > >> > >> local job > > > > > >> > >> > > > >> > > manager, so > > > > > >> > >> > > > >> > > > >> I > > > > > >> > >> > > > >> > > > >> > > > wouldn't expect any delays there. > > The > > > > tasks > > > > > >> are > > > > > >> > >> also > > > > > >> > >> > > just > > > > > >> > >> > > > >> > > > >> > > "/bin/hostname", > > > > > >> > >> > > > >> > > > >> > > > so should return immediately. > > > > > >> > >> > > > >> > > > >> > > > > > > > > >> > >> > > > >> > > > >> > > > I'm going to continue digging into > > this > > > > on my > > > > > >> > >> own, but > > > > > >> > >> > > the > > > > > >> > >> > > > >> 2 > > > > > >> > >> > > > >> > > minute > > > > > >> > >> > > > >> > > > >> delay > > > > > >> > >> > > > >> > > > >> > > > seems like a big clue: does anyone > > have > > > > an > > > > > >> idea > > > > > >> > >> what > > > > > >> > >> > > could > > > > > >> > >> > > > >> cause > > > > > >> > >> > > > >> > > > >> stalls > > > > > >> > >> > > > >> > > > >> > > in > > > > > >> > >> > > > >> > > > >> > > > task submission of 2 minute duration? > > > > > >> > >> > > > >> > > > >> > > > > > > > > >> > >> > > > >> > > > >> > > > Cheers, > > > > > >> > >> > > > >> > > > >> > > > Tim > > > > > >> > >> > > > >> > > > >> > > > > > > > >> > >> > > > >> > > > >> > > > > > > > >> > >> > > > >> > > > >> > > > > > > > >> > >> > > > >> > > > >> > > > > > >> > >> > > > >> > > > >> > > > > > >> > >> > > > >> > > > >> > > > > > >> > >> > > > >> > > > > > > > > > >> > >> > > > >> > > > > > > > >> > >> > > > >> > > > > > > > >> > >> > > > >> > > > > > > > >> > >> > > > >> > > > > > >> > >> > > > >> > > > > > >> > >> > > > >> > > > > > >> > >> > > > > > > > > > >> > >> > > > > > > > >> > >> > > > > > > > >> > >> > > > > > > > >> > >> > > > > > >> > >> > > > > > >> > >> > > > > > >> > > > > > > > >> > > > > > >> > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > From tim.g.armstrong at gmail.com Thu Sep 11 15:30:58 2014 From: tim.g.armstrong at gmail.com (Tim Armstrong) Date: Thu, 11 Sep 2014 15:30:58 -0500 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: <1410461908.31274.1.camel@echo> References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo> <1409860986.7960.3.camel@echo> <1409876838.3600.8.camel@echo> <1409939838.12288.3.camel@echo> <1410205101.24345.22.camel@echo> <1410457173.25856.5.camel@echo> <1410459811.27191.5.camel@echo> <1410461068.27191.7.camel@echo> <1410461908.31274.1.camel@echo> Message-ID: I'll give automatic workers a try. If i That would make sense since the passive workers aren't per-client. - Tim On Thu, Sep 11, 2014 at 1:58 PM, Mihael Hategan wrote: > Can you try automatic workers? > > Passive workers is something I didn't quite get to yet. I believe that > they should only be allowed in shared mode. Thoughts? > > Mihael > > On Thu, 2014-09-11 at 13:54 -0500, Tim Armstrong wrote: > > Yeah, local workers, this was started with start-coaster-service with > conf > > file: > > > > export WORKER_MODE=local > > export IPADDR=127.0.0.1 > > export SERVICE_PORT=53363 > > export JOBSPERNODE=4 > > export LOGDIR=$(pwd) > > > > > > - Tim > > > > On Thu, Sep 11, 2014 at 1:44 PM, Mihael Hategan > wrote: > > > > > Passive workers? > > > > > > On Thu, 2014-09-11 at 13:26 -0500, Tim Armstrong wrote: > > > > Oops, forgot about that > > > > > > > > > > > > > > > > On Thu, Sep 11, 2014 at 1:23 PM, Mihael Hategan > > > > wrote: > > > > > > > > > The coaster logging was broken, and that brokenness caused it to > print > > > > > everything on stdout. That got fixed, so the actual log is now > > > > > in ./cps*.log. > > > > > > > > > > So I probably need that log. > > > > > > > > > > Mihael > > > > > > > > > > On Thu, 2014-09-11 at 13:10 -0500, Tim Armstrong wrote: > > > > > > I meant the github master., but it turns out that I had had the > wrong > > > > > Swift > > > > > > on my path. Apologies for the confusion. > > > > > > > > > > > > I've rerun with the current one. > > > > > > > > > > > > I'm getting a null pointer exception on line 226 of > > > > > > BlockQueueProcessor.java. Adding some printfs revealed that > > > settings was > > > > > > null. > > > > > > > > > > > > Log attached. > > > > > > > > > > > > - Tim > > > > > > > > > > > > Job: Job(id:0 600.000s) > > > > > > Settings: null > > > > > > java.lang.NullPointerException > > > > > > at > > > > > > > > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.checkJob(BlockQueueProcessor.java:228) > > > > > > at > > > > > > > > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.enqueue1(BlockQueueProcessor.java:210) > > > > > > at > > > > > > > > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.BlockQueueProcessor.enqueue(BlockQueueProcessor.java:204) > > > > > > at > > > > > > > > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.enqueue(JobQueue.java:103) > > > > > > at > > > > > > > > > > > > > > > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:96) > > > > > > at > > > > > > > > > > > > > > > org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:112) > > > > > > at > > > > > > > > > > > > > > > org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:589) > > > > > > at > > > > > > > > > > > > > > > org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:175) > > > > > > at > > > > > > org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:90) > > > > > > > > > > > > On Thu, Sep 11, 2014 at 12:41 PM, Tim Armstrong < > > > > > tim.g.armstrong at gmail.com> > > > > > > wrote: > > > > > > > > > > > > > I thought I was running the latest trunk, I'll rebuild and see > if > > > I can > > > > > > > reproduce the issue. > > > > > > > > > > > > > > - Tim > > > > > > > > > > > > > > On Thu, Sep 11, 2014 at 12:39 PM, Mihael Hategan < > > > hategan at mcs.anl.gov> > > > > > > > wrote: > > > > > > > > > > > > > >> The method "getMetaChannel()" has been removed. Where did you > get > > > the > > > > > > >> code from? > > > > > > >> > > > > > > >> Mihael > > > > > > >> > > > > > > >> On Thu, 2014-09-11 at 12:16 -0500, Tim Armstrong wrote: > > > > > > >> > I'm seeing failures when running Swift/T tests with > > > > > > >> > start-coaster-service.sh. > > > > > > >> > > > > > > > >> > E.g. the turbine test coaster-exec-1. I can provide > > > instructions > > > > > for > > > > > > >> > running the test if needed (roughly, you need to build > Swift/T > > > with > > > > > > >> coaster > > > > > > >> > support enabled, then make tests/coaster-exec-1.result in > the > > > > > turbine > > > > > > >> > directory). The github swift-t release is up to date if you > > > want > > > > > to use > > > > > > >> > that. > > > > > > >> > > > > > > > >> > Full log is attached, stack trace excerpt is below. > > > > > > >> > > > > > > > >> > - Tim > > > > > > >> > > > > > > > >> > 2014-09-11 12:11:13,708-0500 INFO BlockQueueProcessor > > > Starting... > > > > > > >> > id=0911-1112130 > > > > > > >> > Using threaded sender for TCPChannel [type: server, contact: > > > > > > >> 127.0.0.1:48242 > > > > > > >> > ] > > > > > > >> > 2014-09-11 12:11:13,708-0500 INFO > AbstractStreamCoasterChannel > > > > > Using > > > > > > >> > threaded sender for TCPChannel [type: server, contact: > > > > > 127.0.0.1:48242] > > > > > > >> > org.globus.cog.coaster.channels.ChannelException: Invalid > > > channel: > > > > > null > > > > > > >> > @id://null-nullS > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > org.globus.cog.coaster.channels.ChannelManager.reserveChannel(ChannelManager.java:226) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.PassiveQueueProcessor.setClientChannelContext(PassiveQueueProcessor.java:41) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > org.globus.cog.abstraction.coaster.service.job.manager.JobQueue.setClientChannelContext(JobQueue.java:135) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:77) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) > > > > > > >> > at > > > > > > >> > > > org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) > > > > > > >> > provider=local > > > > > > >> > 2014-09-11 12:11:13,930-0500 INFO ExecutionTaskHandler > > > > > provider=local > > > > > > >> > org.globus.cog.coaster.channels.ChannelException: Invalid > > > channel: > > > > > null > > > > > > >> > @id://null-nullS > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:432) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > org.globus.cog.coaster.channels.ChannelManager.reserveLongTerm(ChannelManager.java:472) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:80) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) > > > > > > >> > at > > > > > > >> > > > org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) > > > > > > >> > Handler(tag: 38907, SUBMITJOB) sending error: Could not > > > deserialize > > > > > job > > > > > > >> > description > > > > > > >> > org.globus.cog.coaster.ProtocolException: Could not > deserialize > > > job > > > > > > >> > description > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:84) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) > > > > > > >> > at > > > > > > >> > > > org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) > > > > > > >> > Caused by: org.globus.cog.coaster.channels.ChannelException: > > > Invalid > > > > > > >> > channel: null at id://null-nullS > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:432) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > org.globus.cog.coaster.channels.ChannelManager.reserveLongTerm(ChannelManager.java:472) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:80) > > > > > > >> > ... 4 more > > > > > > >> > 2014-09-11 12:11:13,937-0500 INFO RequestReply Handler(tag: > > > 38907, > > > > > > >> > SUBMITJOB) sending error: Could not deserialize job > description > > > > > > >> > org.globus.cog.coaster.ProtocolException: Could not > deserialize > > > job > > > > > > >> > description > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:84) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > org.globus.cog.coaster.handlers.RequestHandler.receiveCompleted(RequestHandler.java:88) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > org.globus.cog.coaster.channels.AbstractCoasterChannel.handleRequest(AbstractCoasterChannel.java:527) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > org.globus.cog.coaster.channels.AbstractStreamCoasterChannel.step(AbstractStreamCoasterChannel.java:173) > > > > > > >> > at > > > > > > >> > > > org.globus.cog.coaster.channels.Multiplexer.run(Multiplexer.java:70) > > > > > > >> > Caused by: org.globus.cog.coaster.channels.ChannelException: > > > Invalid > > > > > > >> > channel: null at id://null-nullS > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:452) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > org.globus.cog.coaster.channels.ChannelManager.getMetaChannel(ChannelManager.java:432) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > org.globus.cog.coaster.channels.ChannelManager.reserveLongTerm(ChannelManager.java:472) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > org.globus.cog.abstraction.coaster.service.SubmitJobHandler.requestComplete(SubmitJobHandler.java:80) > > > > > > >> > ... 4 more > > > > > > >> > > > > > > > >> > > > > > > > >> > On Thu, Sep 11, 2014 at 10:30 AM, Tim Armstrong < > > > > > > >> tim.g.armstrong at gmail.com> > > > > > > >> > wrote: > > > > > > >> > > > > > > > >> > > This all sounds great. > > > > > > >> > > > > > > > > >> > > Just to check that I've understood correctly, from the > > > client's > > > > > point > > > > > > >> of > > > > > > >> > > view: > > > > > > >> > > * The per-client settings behave the same if -shared is > not > > > > > provided. > > > > > > >> > > * Per-client settings are ignored if -shared is provided > > > > > > >> > > > > > > > > >> > > I had one question: > > > > > > >> > > * Do automatically allocated workers work with per-client > > > > > settings? I > > > > > > >> > > understand there were some issues related to sharing > workers > > > > > between > > > > > > >> > > clients. Was the solution to have separate worker pools, > or > > > is > > > > > this > > > > > > >> just > > > > > > >> > > not supported? > > > > > > >> > > > > > > > > >> > > - Tim > > > > > > >> > > > > > > > > >> > > On Mon, Sep 8, 2014 at 2:38 PM, Mihael Hategan < > > > > > hategan at mcs.anl.gov> > > > > > > >> > > wrote: > > > > > > >> > > > > > > > > >> > >> So... > > > > > > >> > >> > > > > > > >> > >> There were bugs. Lots of bugs. > > > > > > >> > >> I did some work over the weekend to fix some of these and > > > clean > > > > > up > > > > > > >> the > > > > > > >> > >> coaster code. Here's a summary: > > > > > > >> > >> > > > > > > >> > >> - there was some stuff in the low level coaster code to > deal > > > with > > > > > > >> > >> persisting coaster channels over multiple connections > with > > > > > various > > > > > > >> > >> options, like periodic connections, client or server > > > initiated > > > > > > >> > >> connections, buffering of commands, etc. None of this was > > > used by > > > > > > >> Swift, > > > > > > >> > >> and the code was pretty messy. I removed that. > > > > > > >> > >> - there were some issues with multiple clients: > > > > > > >> > >> * improper shutdown of relevant workers when a client > > > > > disconnected > > > > > > >> > >> * the worker task dispatcher was a singleton and had a > > > > > reference to > > > > > > >> > >> one block allocator, whereas multiple clients involved > > > multiple > > > > > > >> > >> allocators. > > > > > > >> > >> - there were a bunch of locking issues in the C client > that > > > > > valgrind > > > > > > >> > >> caught > > > > > > >> > >> - the idea of remote job ids was a bit hard to work with. > > > This > > > > > > >> remote id > > > > > > >> > >> was the job id that the service assigned to a job. This > is > > > > > necessary > > > > > > >> > >> because two different clients can submit jobs with the > same > > > id. > > > > > The > > > > > > >> > >> remote id would be communicated to the client as the > reply > > > to the > > > > > > >> submit > > > > > > >> > >> request. However, it was entirely possible for a > notification > > > > > about > > > > > > >> job > > > > > > >> > >> status to be sent to the client before the submit reply > was. > > > > > Since > > > > > > >> > >> notifications were sent using the remote-id, the client > would > > > > > have no > > > > > > >> > >> idea what job the notifications belonged to. Now, the > server > > > > > might > > > > > > >> need > > > > > > >> > >> a unique job id, but there is no reason why it cannot > use the > > > > > client > > > > > > >> id > > > > > > >> > >> when communicating the status to a client. So that's > there > > > now. > > > > > > >> > >> - the way the C client was working, its jobs ended up not > > > going > > > > > to > > > > > > >> the > > > > > > >> > >> workers, but the local queue. The service settings now > allow > > > > > > >> specifying > > > > > > >> > >> the provider/jobManager/url to be used to start blocks, > and > > > jobs > > > > > are > > > > > > >> > >> routed appropriately if they do not have the batch job > flag > > > set. > > > > > > >> > >> > > > > > > >> > >> I also added a shared service mode. We discussed this > before. > > > > > > >> Basically > > > > > > >> > >> you start the coaster service with "-shared > > > > > " and > > > > > > >> > >> all the settings are read from that file. In this case, > all > > > > > clients > > > > > > >> > >> share the same worker pool, and client settings are > ignored. > > > > > > >> > >> > > > > > > >> > >> The C client now has a multi-job testing tool which can > > > submit > > > > > many > > > > > > >> jobs > > > > > > >> > >> with the desired level of concurrency. > > > > > > >> > >> > > > > > > >> > >> I have tested the C client with both shared and > non-shared > > > mode, > > > > > with > > > > > > >> > >> various levels of jobs being sent, with either one or two > > > > > concurrent > > > > > > >> > >> clients. > > > > > > >> > >> > > > > > > >> > >> I haven't tested manual workers. > > > > > > >> > >> > > > > > > >> > >> I've also decided that during normal operation (i.e. > client > > > > > connects, > > > > > > >> > >> submits jobs, shuts down gracefully), there should be no > > > > > exceptions > > > > > > >> in > > > > > > >> > >> the coaster log. I think we should stick to that > principle. > > > This > > > > > was > > > > > > >> the > > > > > > >> > >> case last I tested, and we should consider any deviation > from > > > > > that > > > > > > >> to be > > > > > > >> > >> a problem. Of course, there are some things for which > there > > > is no > > > > > > >> > >> graceful shut down, such as ctrl+C-ing a manual worker. > > > > > Exceptions > > > > > > >> are > > > > > > >> > >> fine in that case. > > > > > > >> > >> > > > > > > >> > >> So anyway, let's start from here. > > > > > > >> > >> > > > > > > >> > >> Mihael > > > > > > >> > >> > > > > > > >> > >> On Fri, 2014-09-05 at 13:09 -0500, Tim Armstrong wrote: > > > > > > >> > >> > Thanks, let me know if there's anything I can help do. > > > > > > >> > >> > > > > > > > >> > >> > - Tim > > > > > > >> > >> > > > > > > > >> > >> > > > > > > > >> > >> > On Fri, Sep 5, 2014 at 12:57 PM, Mihael Hategan < > > > > > > >> hategan at mcs.anl.gov> > > > > > > >> > >> wrote: > > > > > > >> > >> > > > > > > > >> > >> > > Thanks. It also seems that there is an older bug in > > > there in > > > > > > >> which the > > > > > > >> > >> > > client connection is not properly accounted for and > > > things > > > > > start > > > > > > >> > >> failing > > > > > > >> > >> > > two minutes after the client connects (which is also > > > probably > > > > > > >> why you > > > > > > >> > >> > > didn't see this in runs with many short client > > > connections). > > > > > I'm > > > > > > >> not > > > > > > >> > >> > > sure why the fix for that bug isn't in the trunk > code. > > > > > > >> > >> > > > > > > > > >> > >> > > In any event, I'll set up a client submission loop > and > > > fix > > > > > all > > > > > > >> these > > > > > > >> > >> > > things. > > > > > > >> > >> > > > > > > > > >> > >> > > Mihael > > > > > > >> > >> > > > > > > > > >> > >> > > On Fri, 2014-09-05 at 12:13 -0500, Tim Armstrong > wrote: > > > > > > >> > >> > > > Ok, here it is with the additional debug messages. > > > Source > > > > > code > > > > > > >> > >> change is > > > > > > >> > >> > > > in commit 890c41f2ba701b10264553471590096d6f94c278. > > > > > > >> > >> > > > > > > > > > >> > >> > > > Warning: the tarball will expand to several > gigabytes > > > of > > > > > logs. > > > > > > >> > >> > > > > > > > > > >> > >> > > > I had to do multiple client runs to trigger it. It > > > seems > > > > > like > > > > > > >> the > > > > > > >> > >> > > problem > > > > > > >> > >> > > > might be triggered by abnormal termination of the > > > client. > > > > > > >> First 18 > > > > > > >> > >> runs > > > > > > >> > >> > > > went fine, problem only started when I ctrl-c-ed > the > > > > > swift/t > > > > > > >> run #19 > > > > > > >> > >> > > before > > > > > > >> > >> > > > the run #20 that exhibited delays. > > > > > > >> > >> > > > > > > > > > >> > >> > > > > > > > > http://people.cs.uchicago.edu/~tga/files/worker-logs3.tar.gz > > > > > > >> > >> > > > > > > > > > >> > >> > > > - Tim > > > > > > >> > >> > > > > > > > > > >> > >> > > > > > > > > > >> > >> > > > On Fri, Sep 5, 2014 at 8:55 AM, Tim Armstrong < > > > > > > >> > >> tim.g.armstrong at gmail.com > > > > > > >> > >> > > > > > > > > > >> > >> > > > wrote: > > > > > > >> > >> > > > > > > > > > >> > >> > > > > It's here: > > > > > > >> > >> > > > > > > > > > > >> > >> > > > > > > http://people.cs.uchicago.edu/~tga/files/coaster-service.out.full.gz > > > > > > >> . > > > > > > >> > >> > > > > > > > > > > >> > >> > > > > I'll add some extra debug messages in the > coaster C++ > > > > > client > > > > > > >> and > > > > > > >> > >> see > > > > > > >> > >> > > if I > > > > > > >> > >> > > > > can recreate the scenario. > > > > > > >> > >> > > > > > > > > > > >> > >> > > > > - Tim > > > > > > >> > >> > > > > > > > > > > >> > >> > > > > > > > > > > >> > >> > > > > On Thu, Sep 4, 2014 at 7:27 PM, Mihael Hategan < > > > > > > >> > >> hategan at mcs.anl.gov> > > > > > > >> > >> > > > > wrote: > > > > > > >> > >> > > > > > > > > > > >> > >> > > > >> Ok, so that's legit. > > > > > > >> > >> > > > >> > > > > > > >> > >> > > > >> It does look like shut down workers are not > properly > > > > > > >> accounted > > > > > > >> > >> for in > > > > > > >> > >> > > > >> some places (and I believe Yadu submitted a bug > for > > > > > this). > > > > > > >> > >> However, I > > > > > > >> > >> > > do > > > > > > >> > >> > > > >> not see the dead time you mention in either of > the > > > last > > > > > two > > > > > > >> sets > > > > > > >> > >> of > > > > > > >> > >> > > > >> logs. It looks like each client instance > submits a > > > > > continous > > > > > > >> > >> stream of > > > > > > >> > >> > > > >> jobs. > > > > > > >> > >> > > > >> > > > > > > >> > >> > > > >> So let's get back to the initial log. Can I > have the > > > > > full > > > > > > >> > >> service log? > > > > > > >> > >> > > > >> I'm trying to track what happened with the jobs > > > > > submitted > > > > > > >> before > > > > > > >> > >> the > > > > > > >> > >> > > > >> first big pause. > > > > > > >> > >> > > > >> > > > > > > >> > >> > > > >> Also, a log message in > > > CoasterClient::updateJobStatus() > > > > > (or > > > > > > >> > >> friends) > > > > > > >> > >> > > > >> would probably help a lot here. > > > > > > >> > >> > > > >> > > > > > > >> > >> > > > >> Mihael > > > > > > >> > >> > > > >> > > > > > > >> > >> > > > >> On Thu, 2014-09-04 at 15:34 -0500, Tim Armstrong > > > wrote: > > > > > > >> > >> > > > >> > Should be here: > > > > > > >> > >> > > > >> > > > > > > > >> > >> > > > >> > > > > > > http://people.cs.uchicago.edu/~tga/worker-logs2.tar.gz > > > > > > >> > >> > > > >> > > > > > > > >> > >> > > > >> > > > > > > > >> > >> > > > >> > > > > > > > >> > >> > > > >> > > > > > > > >> > >> > > > >> > On Thu, Sep 4, 2014 at 3:03 PM, Mihael > Hategan < > > > > > > >> > >> hategan at mcs.anl.gov > > > > > > >> > >> > > > > > > > > > >> > >> > > > >> wrote: > > > > > > >> > >> > > > >> > > > > > > > >> > >> > > > >> > > The first worker "failing" is 0904-20022331. > > > The log > > > > > > >> looks > > > > > > >> > >> funny > > > > > > >> > >> > > at > > > > > > >> > >> > > > >> the > > > > > > >> > >> > > > >> > > end. > > > > > > >> > >> > > > >> > > > > > > > > >> > >> > > > >> > > Can you git pull and re-run? The worker is > > > getting > > > > > some > > > > > > >> > >> command > > > > > > >> > >> > > at the > > > > > > >> > >> > > > >> > > end there and doing nothing about it and I > > > wonder > > > > > why. > > > > > > >> > >> > > > >> > > > > > > > > >> > >> > > > >> > > Mihael > > > > > > >> > >> > > > >> > > > > > > > > >> > >> > > > >> > > On Thu, 2014-09-04 at 14:35 -0500, Tim > Armstrong > > > > > wrote: > > > > > > >> > >> > > > >> > > > Ok, now I have some worker logs: > > > > > > >> > >> > > > >> > > > > > > > > > >> > >> > > > >> > > > > > > > > > >> > >> > > > http://people.cs.uchicago.edu/~tga/2014-9-4-worker-logs.tar.gz > > > > > > >> > >> > > > >> > > > > > > > > > >> > >> > > > >> > > > There's nothing obvious I see in the > worker > > > logs > > > > > that > > > > > > >> would > > > > > > >> > >> > > > >> indicate why > > > > > > >> > >> > > > >> > > > the connection was broken. > > > > > > >> > >> > > > >> > > > > > > > > > >> > >> > > > >> > > > - Tim > > > > > > >> > >> > > > >> > > > > > > > > > >> > >> > > > >> > > > > > > > > > >> > >> > > > >> > > > On Thu, Sep 4, 2014 at 1:11 PM, Tim > Armstrong > > > < > > > > > > >> > >> > > > >> tim.g.armstrong at gmail.com > > > > > > >> > >> > > > >> > > > > > > > > > >> > >> > > > >> > > > wrote: > > > > > > >> > >> > > > >> > > > > > > > > > >> > >> > > > >> > > > > This is all running locally on my > laptop, > > > so I > > > > > > >> think we > > > > > > >> > >> can > > > > > > >> > >> > > rule > > > > > > >> > >> > > > >> out > > > > > > >> > >> > > > >> > > 1). > > > > > > >> > >> > > > >> > > > > > > > > > > >> > >> > > > >> > > > > It also seems like it's a state the > coaster > > > > > service > > > > > > >> gets > > > > > > >> > >> into > > > > > > >> > >> > > > >> after a > > > > > > >> > >> > > > >> > > few > > > > > > >> > >> > > > >> > > > > client sessions: generally the first > > > coaster run > > > > > > >> works > > > > > > >> > >> fine, > > > > > > >> > >> > > then > > > > > > >> > >> > > > >> > > after a > > > > > > >> > >> > > > >> > > > > few runs the problem occurs more > frequently. > > > > > > >> > >> > > > >> > > > > > > > > > > >> > >> > > > >> > > > > I'm going to try and get worker logs, > in the > > > > > > >> meantime > > > > > > >> > >> i've got > > > > > > >> > >> > > > >> some > > > > > > >> > >> > > > >> > > > > jstacks (attached). > > > > > > >> > >> > > > >> > > > > > > > > > > >> > >> > > > >> > > > > Matching service logs (largish) are > here if > > > > > needed: > > > > > > >> > >> > > > >> > > > > > > > > > http://people.cs.uchicago.edu/~tga/service.out.gz > > > > > > >> > >> > > > >> > > > > > > > > > > >> > >> > > > >> > > > > > > > > > > >> > >> > > > >> > > > > On Wed, Sep 3, 2014 at 10:35 PM, Mihael > > > Hategan > > > > > < > > > > > > >> > >> > > > >> hategan at mcs.anl.gov> > > > > > > >> > >> > > > >> > > > > wrote: > > > > > > >> > >> > > > >> > > > > > > > > > > >> > >> > > > >> > > > >> Ah, makes sense. > > > > > > >> > >> > > > >> > > > >> > > > > > > >> > >> > > > >> > > > >> 2 minutes is the channel timeout. Each > live > > > > > > >> connection > > > > > > >> > >> is > > > > > > >> > >> > > > >> guaranteed > > > > > > >> > >> > > > >> > > to > > > > > > >> > >> > > > >> > > > >> have some communication for any 2 > minute > > > time > > > > > > >> window, > > > > > > >> > >> > > partially > > > > > > >> > >> > > > >> due to > > > > > > >> > >> > > > >> > > > >> periodic heartbeats (sent every 1 > minute). > > > If > > > > > no > > > > > > >> > >> packets flow > > > > > > >> > >> > > > >> for the > > > > > > >> > >> > > > >> > > > >> duration of 2 minutes, the connection > is > > > > > assumed > > > > > > >> broken > > > > > > >> > >> and > > > > > > >> > >> > > all > > > > > > >> > >> > > > >> jobs > > > > > > >> > >> > > > >> > > > >> that were submitted to the respective > > > workers > > > > > are > > > > > > >> > >> considered > > > > > > >> > >> > > > >> failed. > > > > > > >> > >> > > > >> > > So > > > > > > >> > >> > > > >> > > > >> there seems to be an issue with the > > > > > connections to > > > > > > >> some > > > > > > >> > >> of > > > > > > >> > >> > > the > > > > > > >> > >> > > > >> > > workers, > > > > > > >> > >> > > > >> > > > >> and it takes 2 minutes to detect them. > > > > > > >> > >> > > > >> > > > >> > > > > > > >> > >> > > > >> > > > >> Since the service seems to be alive > > > (although a > > > > > > >> jstack > > > > > > >> > >> on the > > > > > > >> > >> > > > >> service > > > > > > >> > >> > > > >> > > > >> when thing seem to hang might help), > this > > > > > leaves > > > > > > >> two > > > > > > >> > >> > > > >> possibilities: > > > > > > >> > >> > > > >> > > > >> 1 - some genuine network problem > > > > > > >> > >> > > > >> > > > >> 2 - the worker died without properly > > > closing > > > > > TCP > > > > > > >> > >> connections > > > > > > >> > >> > > > >> > > > >> > > > > > > >> > >> > > > >> > > > >> If (2), you could enable worker logging > > > > > > >> > >> > > > >> > > > >> (Settings::Key::WORKER_LOGGING_LEVEL = > > > > > "DEBUG") to > > > > > > >> see > > > > > > >> > >> if > > > > > > >> > >> > > > >> anything > > > > > > >> > >> > > > >> > > shows > > > > > > >> > >> > > > >> > > > >> up. > > > > > > >> > >> > > > >> > > > >> > > > > > > >> > >> > > > >> > > > >> Mihael > > > > > > >> > >> > > > >> > > > >> > > > > > > >> > >> > > > >> > > > >> On Wed, 2014-09-03 at 20:26 -0500, Tim > > > > > Armstrong > > > > > > >> wrote: > > > > > > >> > >> > > > >> > > > >> > Here are client and service logs, > with > > > part > > > > > of > > > > > > >> > >> service log > > > > > > >> > >> > > > >> edited > > > > > > >> > >> > > > >> > > down > > > > > > >> > >> > > > >> > > > >> to > > > > > > >> > >> > > > >> > > > >> > be a reasonable size (I have the full > > > thing > > > > > if > > > > > > >> > >> needed, but > > > > > > >> > >> > > it > > > > > > >> > >> > > > >> was > > > > > > >> > >> > > > >> > > over a > > > > > > >> > >> > > > >> > > > >> > gigabyte). > > > > > > >> > >> > > > >> > > > >> > > > > > > > >> > >> > > > >> > > > >> > One relevant section is from 19:49:35 > > > > > onwards. > > > > > > >> The > > > > > > >> > >> client > > > > > > >> > >> > > > >> submits 4 > > > > > > >> > >> > > > >> > > > >> jobs > > > > > > >> > >> > > > >> > > > >> > (its limit), but they don't complete > > > until > > > > > > >> 19:51:32 > > > > > > >> > >> or so > > > > > > >> > >> > > (I > > > > > > >> > >> > > > >> can see > > > > > > >> > >> > > > >> > > > >> that > > > > > > >> > >> > > > >> > > > >> > one task completed based on > ncompleted=1 > > > in > > > > > the > > > > > > >> > >> > > check_tasks log > > > > > > >> > >> > > > >> > > > >> message). > > > > > > >> > >> > > > >> > > > >> > It looks like something has happened > with > > > > > broken > > > > > > >> > >> pipes and > > > > > > >> > >> > > > >> workers > > > > > > >> > >> > > > >> > > being > > > > > > >> > >> > > > >> > > > >> > lost, but I'm not sure what the > ultimate > > > > > cause of > > > > > > >> > >> that is > > > > > > >> > >> > > > >> likely to > > > > > > >> > >> > > > >> > > be. > > > > > > >> > >> > > > >> > > > >> > > > > > > > >> > >> > > > >> > > > >> > - Tim > > > > > > >> > >> > > > >> > > > >> > > > > > > > >> > >> > > > >> > > > >> > > > > > > > >> > >> > > > >> > > > >> > > > > > > > >> > >> > > > >> > > > >> > On Wed, Sep 3, 2014 at 6:20 PM, > Mihael > > > > > Hategan < > > > > > > >> > >> > > > >> hategan at mcs.anl.gov > > > > > > >> > >> > > > >> > > > > > > > > > >> > >> > > > >> > > > >> wrote: > > > > > > >> > >> > > > >> > > > >> > > > > > > > >> > >> > > > >> > > > >> > > Hi Tim, > > > > > > >> > >> > > > >> > > > >> > > > > > > > > >> > >> > > > >> > > > >> > > I've never seen this before with > pure > > > Java. > > > > > > >> > >> > > > >> > > > >> > > > > > > > > >> > >> > > > >> > > > >> > > Do you have logs from these runs? > > > > > > >> > >> > > > >> > > > >> > > > > > > > > >> > >> > > > >> > > > >> > > Mihael > > > > > > >> > >> > > > >> > > > >> > > > > > > > > >> > >> > > > >> > > > >> > > On Wed, 2014-09-03 at 16:49 -0500, > Tim > > > > > > >> Armstrong > > > > > > >> > >> wrote: > > > > > > >> > >> > > > >> > > > >> > > > I'm running a test Swift/T script > > > that > > > > > submit > > > > > > >> > >> tasks to > > > > > > >> > >> > > > >> Coasters > > > > > > >> > >> > > > >> > > > >> through > > > > > > >> > >> > > > >> > > > >> > > the > > > > > > >> > >> > > > >> > > > >> > > > C++ client and I'm seeing some > odd > > > > > behaviour > > > > > > >> > >> where task > > > > > > >> > >> > > > >> > > > >> > > > submission/execution is stalling > for > > > ~2 > > > > > > >> minute > > > > > > >> > >> periods. > > > > > > >> > >> > > > >> For > > > > > > >> > >> > > > >> > > > >> example, I'm > > > > > > >> > >> > > > >> > > > >> > > > seeing submit log messages like > > > > > "submitting > > > > > > >> > >> > > > >> > > > >> > > > > urn:133-1409778135377-1409778135378: > > > > > > >> > >> /bin/hostname" in > > > > > > >> > >> > > > >> bursts of > > > > > > >> > >> > > > >> > > > >> several > > > > > > >> > >> > > > >> > > > >> > > > seconds with a gap of roughly 2 > > > minutes > > > > > in > > > > > > >> > >> between, > > > > > > >> > >> > > e.g. > > > > > > >> > >> > > > >> I'm > > > > > > >> > >> > > > >> > > seeing > > > > > > >> > >> > > > >> > > > >> > > bursts > > > > > > >> > >> > > > >> > > > >> > > > with the following intervals in > my > > > logs. > > > > > > >> > >> > > > >> > > > >> > > > > > > > > > >> > >> > > > >> > > > >> > > > 16:07:04,603 to 16:07:10,391 > > > > > > >> > >> > > > >> > > > >> > > > 16:09:07,377 to 16:09:13,076 > > > > > > >> > >> > > > >> > > > >> > > > 16:11:10,005 to 16:11:16,770 > > > > > > >> > >> > > > >> > > > >> > > > 16:13:13,291 to 16:13:19,296 > > > > > > >> > >> > > > >> > > > >> > > > 16:15:16,000 to 16:15:21,602 > > > > > > >> > >> > > > >> > > > >> > > > > > > > > > >> > >> > > > >> > > > >> > > > From what I can tell, the delay > is > > > on the > > > > > > >> coaster > > > > > > >> > >> > > service > > > > > > >> > >> > > > >> side: > > > > > > >> > >> > > > >> > > the > > > > > > >> > >> > > > >> > > > >> C > > > > > > >> > >> > > > >> > > > >> > > > client is just waiting for a > > > response. > > > > > > >> > >> > > > >> > > > >> > > > > > > > > > >> > >> > > > >> > > > >> > > > The jobs are just being submitted > > > > > through the > > > > > > >> > >> local job > > > > > > >> > >> > > > >> > > manager, so > > > > > > >> > >> > > > >> > > > >> I > > > > > > >> > >> > > > >> > > > >> > > > wouldn't expect any delays there. > > > The > > > > > tasks > > > > > > >> are > > > > > > >> > >> also > > > > > > >> > >> > > just > > > > > > >> > >> > > > >> > > > >> > > "/bin/hostname", > > > > > > >> > >> > > > >> > > > >> > > > so should return immediately. > > > > > > >> > >> > > > >> > > > >> > > > > > > > > > >> > >> > > > >> > > > >> > > > I'm going to continue digging > into > > > this > > > > > on my > > > > > > >> > >> own, but > > > > > > >> > >> > > the > > > > > > >> > >> > > > >> 2 > > > > > > >> > >> > > > >> > > minute > > > > > > >> > >> > > > >> > > > >> delay > > > > > > >> > >> > > > >> > > > >> > > > seems like a big clue: does > anyone > > > have > > > > > an > > > > > > >> idea > > > > > > >> > >> what > > > > > > >> > >> > > could > > > > > > >> > >> > > > >> cause > > > > > > >> > >> > > > >> > > > >> stalls > > > > > > >> > >> > > > >> > > > >> > > in > > > > > > >> > >> > > > >> > > > >> > > > task submission of 2 minute > duration? > > > > > > >> > >> > > > >> > > > >> > > > > > > > > > >> > >> > > > >> > > > >> > > > Cheers, > > > > > > >> > >> > > > >> > > > >> > > > Tim > > > > > > >> > >> > > > >> > > > >> > > > > > > > > >> > >> > > > >> > > > >> > > > > > > > > >> > >> > > > >> > > > >> > > > > > > > > >> > >> > > > >> > > > >> > > > > > > >> > >> > > > >> > > > >> > > > > > > >> > >> > > > >> > > > >> > > > > > > >> > >> > > > >> > > > > > > > > > > >> > >> > > > >> > > > > > > > > >> > >> > > > >> > > > > > > > > >> > >> > > > >> > > > > > > > > >> > >> > > > >> > > > > > > >> > >> > > > >> > > > > > > >> > >> > > > >> > > > > > > >> > >> > > > > > > > > > > >> > >> > > > > > > > > >> > >> > > > > > > > > >> > >> > > > > > > > > >> > >> > > > > > > >> > >> > > > > > > >> > >> > > > > > > >> > > > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Sep 11 15:34:48 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 11 Sep 2014 13:34:48 -0700 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo> <1409860986.7960.3.camel@echo> <1409876838.3600.8.camel@echo> <1409939838.12288.3.camel@echo> <1410205101.24345.22.camel@echo> <1410457173.25856.5.camel@echo> <1410459811.27191.5.camel@echo> <1410461068.27191.7.camel@echo> <1410461908.31274.1.camel@echo> Message-ID: <1410467688.638.0.camel@echo> On Thu, 2014-09-11 at 15:30 -0500, Tim Armstrong wrote: > I'll give automatic workers a try. If i > That would make sense since the passive workers aren't per-client. Right. That was my thought, too. I'll get this fixed soon-ish. Mihael From tim.g.armstrong at gmail.com Fri Sep 12 17:14:03 2014 From: tim.g.armstrong at gmail.com (Tim Armstrong) Date: Fri, 12 Sep 2014 17:14:03 -0500 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: <1410467688.638.0.camel@echo> References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo> <1409860986.7960.3.camel@echo> <1409876838.3600.8.camel@echo> <1409939838.12288.3.camel@echo> <1410205101.24345.22.camel@echo> <1410457173.25856.5.camel@echo> <1410459811.27191.5.camel@echo> <1410461068.27191.7.camel@echo> <1410461908.31274.1.camel@echo> <1410467688.638.0.camel@echo> Message-ID: My initial test with active workers seems to be running fine. The passive worker test is failing - I'll rerun that once it's fixed. - Tim On Thu, Sep 11, 2014 at 3:34 PM, Mihael Hategan wrote: > On Thu, 2014-09-11 at 15:30 -0500, Tim Armstrong wrote: > > I'll give automatic workers a try. If i > > That would make sense since the passive workers aren't per-client. > > Right. That was my thought, too. I'll get this fixed soon-ish. > > Mihael > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Sat Sep 13 19:49:39 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Sat, 13 Sep 2014 17:49:39 -0700 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo> <1409860986.7960.3.camel@echo> <1409876838.3600.8.camel@echo> <1409939838.12288.3.camel@echo> <1410205101.24345.22.camel@echo> <1410457173.25856.5.camel@echo> <1410459811.27191.5.camel@echo> <1410461068.27191.7.camel@echo> <1410461908.31274.1.camel@echo> <1410467688.638.0.camel@echo> Message-ID: <1410655779.22697.1.camel@echo> I didn't send an email about it, but I fixed it thursday evening. Or at least I hope I did. Mihael On Fri, 2014-09-12 at 17:14 -0500, Tim Armstrong wrote: > My initial test with active workers seems to be running fine. The passive > worker test is failing - I'll rerun that once it's fixed. > > - Tim > > On Thu, Sep 11, 2014 at 3:34 PM, Mihael Hategan wrote: > > > On Thu, 2014-09-11 at 15:30 -0500, Tim Armstrong wrote: > > > I'll give automatic workers a try. If i > > > That would make sense since the passive workers aren't per-client. > > > > Right. That was my thought, too. I'll get this fixed soon-ish. > > > > Mihael > > > > > > From tim.g.armstrong at gmail.com Mon Sep 15 12:52:45 2014 From: tim.g.armstrong at gmail.com (Tim Armstrong) Date: Mon, 15 Sep 2014 12:52:45 -0500 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: <1410655779.22697.1.camel@echo> References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo> <1409860986.7960.3.camel@echo> <1409876838.3600.8.camel@echo> <1409939838.12288.3.camel@echo> <1410205101.24345.22.camel@echo> <1410457173.25856.5.camel@echo> <1410459811.27191.5.camel@echo> <1410461068.27191.7.camel@echo> <1410461908.31274.1.camel@echo> <1410467688.638.0.camel@echo> <1410655779.22697.1.camel@echo> Message-ID: So I was trying to get it to run local jobs via a passive service. The jobs just seem to be accumulating in the service's queue and not being run. Maybe I'm using the wrong job manager - it's being left as NULL, which is converted to fork. - TIm On Sat, Sep 13, 2014 at 7:49 PM, Mihael Hategan wrote: > I didn't send an email about it, but I fixed it thursday evening. Or at > least I hope I did. > > Mihael > > On Fri, 2014-09-12 at 17:14 -0500, Tim Armstrong wrote: > > My initial test with active workers seems to be running fine. The > passive > > worker test is failing - I'll rerun that once it's fixed. > > > > - Tim > > > > On Thu, Sep 11, 2014 at 3:34 PM, Mihael Hategan > wrote: > > > > > On Thu, 2014-09-11 at 15:30 -0500, Tim Armstrong wrote: > > > > I'll give automatic workers a try. If i > > > > That would make sense since the passive workers aren't per-client. > > > > > > Right. That was my thought, too. I'll get this fixed soon-ish. > > > > > > Mihael > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: passive-no-run.tar.gz Type: application/x-gzip Size: 82317 bytes Desc: not available URL: From iraicu at cs.iit.edu Mon Sep 15 16:12:05 2014 From: iraicu at cs.iit.edu (Ioan Raicu) Date: Mon, 15 Sep 2014 16:12:05 -0500 Subject: [Swift-devel] CFP: IEEE/ACM Int. Symposium on Big Data Computing (BDC) 2014 -- 1 week deadline extension Message-ID: <54175625.7080008@cs.iit.edu> Call for Papers IEEE/ACM International Symposium on Big Data Computing (BDC) 2014 December 8-11, 2014, London, UK http://www.cloudbus.org/bdc2014 In conjunction with: 7th IEEE/ACM International Conference on Utility and Cloud Computing (UCC 2014) Sponsored by: IEEE Computer Society and ACM (Association for Computing Machinery) Introduction =============================================================================== Rapid advances in digital sensors, networks, storage, and computation along with their availability at low cost is leading to the creation of huge collections of data -- dubbed as Big Data. This data has the potential for enabling new insights that can change the way business, science, and governments deliver services to their consumers and can impact society as a whole. This has led to the emergence of the Big Data Computing paradigm focusing on sensing, collection, storage, management and analysis of data from variety of sources to enable new value and insights. To realize the full potential of Big Data Computing, we need to address several challenges and develop suitable conceptual and technological solutions for dealing them. These include life-cycle management of data, large-scale storage, flexible processing infrastructure, data modelling, scalable machine learning and data analysis algorithms, techniques for sampling and making trade-off between data processing time and accuracy, and dealing with privacy and ethical issues involved in data sensing, storage, processing, and actions. The IEEE/ACM International Symposium on Big Data Computing (BDC) 2014 -- held in conjunction with 7th IEEE/ACM International Conference on Utility and Cloud Computing (UCC 2014), December 8-11, 2014, London, UK, aims at bringing together international researchers, developers, policy makers, and users and to provide an international forum to present leading research activities, technical solutions, and results on a broad range of topics related to Big Data Computing paradigms, platforms and their applications. The conference features keynotes, technical presentations, posters, workshops, tutorials, as well as competitions featuring live demonstrations. Topics =============================================================================== Topics of interest include, but are not limited to: I. Big Data Science * Analytics * Algorithms for Big Data * Energy-efficient Algorithms * Big Data Search * Big Data Acquisition, Integration, Cleaning, and Best Practices * Visualization of Big Data II. Big Data Infrastructures and Platforms * Programming Systems * Cyber-Infrastructure * Performance evaluation * Fault tolerance and reliability * I/O and Data management * Storage Systems (including file systems, NoSQL, and RDBMS) * Resource management * Many-Task Computing * Many-core computing and accelerators III. Big Data Security and Policy * Management Policies * Data Privacy * Data Security * Big Data Archival and Preservation * Big Data Provenance IV. Big Data Applications * Scientific application cases studies on Cloud infrastructure * Big Data Applications at Scale * Experience Papers with Big Data Application Deployments * Data streaming applications * Big Data in Social Networks * Healthcare Applications * Enterprise Applications IMPORTANT DATES =============================================================================== * Abstracts Due: September 15th, 2014 * Papers Due: September 22nd, 2014 * Notification of Acceptance: October 15th, 2014 * Camera Ready Papers Due: October 31st, 2014 Note: Those who submit an abstract by the deadline will be given 1 week to upload the final paper. PAPER SUBMISSION =============================================================================== Authors are invited to submit papers electronically. Submitted manuscripts should be structured as technical papers and may not exceed 10 letter size (8.5 x 11) pages including figures, tables and references using the IEEE format for conference proceedings (print area of 6-1/2 inches (16.51 cm) wide by 8-7/8 inches (22.51 cm) high, two-column format with columns 3-1/16 inches (7.85 cm) wide with a 3/8 inch (0.81 cm) space between them, single-spaced 10-point Times fully justified text). Submissions not conforming to these guidelines may be returned without review. Authors should submit the manuscript in PDF format and make sure that the file will print on a printer that uses letter size (8.5 x 11) paper. The official language of the meeting is English. All manuscripts will be reviewed and will be judged on correctness, originality, technical strength, significance, quality of presentation, and interest and relevance to the conference attendees. Papers conforming to the above guidelines can be submitted through the BDC 2014 paper submission system (https://www.easychair.org/conferences/?conf=bdc2014). Submitted papers must represent original unpublished research that is not currently under review for any other conference or journal. Papers not following these guidelines will be rejected without review and further action may be taken, including (but not limited to) notifications sent to the heads of the institutions of the authors and sponsors of the conference. Submissions received after the due date, exceeding length limit, or not appropriately structured may also not be considered. Authors may contact the conference PC Chair for more information. The proceedings will be published through the IEEE Computer Society Press, USA and will be made online through the IEEE Digital Library. Selected papers from BDC 2014 will be invited to extend and submit to the Special Issue on Many-Task Computing in the Cloud in the IEEE Transaction on Cloud Computing (http://datasys.cs.iit.edu/events/TCC-MTC15/CFP_TCC-MTC15.pdf) CHAIRS & COMMITTEES =============================================================================== General Co-Chairs: * Rajkumar Buyya, University of Melbourne, Australia * Divyakant Agrawal, University of California at Santa Barbara, USA Program Co-Chairs: * Ioan Raicu, Illinois Institute of Technology and Argonne National Lab., USA * Manish Parashar, Rutgers, The State University of New Jersey, USA Area Track Co-Chairs: * Big Data Science o Omer F. Rana, Cardiff University, UK o Ilkay Altintas, University of California, San Diego, USA * Big Data Infrastructures and Platforms o Amy Apon, Clemson University, USA o Jiannong Cao, Honk Kong Polytechnic University * Big Data Security and Policy o Bogdan Carbunar, Florida International University * Big Data Applications o Dennis Gannon, Microsoft Research, USA Cyber Chair * Amir Vahid, University of Melbourne, Australia Publicity Chairs * Carlos Westphall, Federal University of Santa Catarina, Brazil * Ching-Hsien Hsu, Chung Hua Univ., Taiwan & Tianjin Univ. of Technology, China * Rong Ge, Marquette University, USA * Giuliano Casale, Imperial College London, UK Organizing Chair: * Ashiq Anjum, University of Derby, UK -- ================================================================= Ioan Raicu, Ph.D. Assistant Professor, Illinois Institute of Technology (IIT) Guest Research Faculty, Argonne National Laboratory (ANL) ================================================================= Data-Intensive Distributed Systems Laboratory, CS/IIT Distributed Systems Laboratory, MCS/ANL ================================================================= Editor: IEEE TCC, Springer Cluster, Springer JoCCASA Chair: IEEE/ACM MTAGS, ACM ScienceCloud ================================================================= Cel: 1-847-722-0876 Office: 1-312-567-5704 Email: iraicu at cs.iit.edu Web: http://www.cs.iit.edu/~iraicu/ Web: http://datasys.cs.iit.edu/ LinkedIn: http://www.linkedin.com/in/ioanraicu Google: http://scholar.google.com/citations?user=jE73HYAAAAAJ ================================================================= ================================================================= From hategan at mcs.anl.gov Tue Sep 16 03:50:46 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 16 Sep 2014 01:50:46 -0700 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo> <1409860986.7960.3.camel@echo> <1409876838.3600.8.camel@echo> <1409939838.12288.3.camel@echo> <1410205101.24345.22.camel@echo> <1410457173.25856.5.camel@echo> <1410459811.27191.5.camel@echo> <1410461068.27191.7.camel@echo> <1410461908.31274.1.camel@echo> <1410467688.638.0.camel@echo> <1410655779.22697.1.camel@echo> Message-ID: <1410857446.27823.1.camel@echo> On Mon, 2014-09-15 at 12:52 -0500, Tim Armstrong wrote: > So I was trying to get it to run local jobs via a passive service. The > jobs just seem to be accumulating in the service's queue and not being run. > > Maybe I'm using the wrong job manager - it's being left as NULL, which is > converted to fork. I can see how that would happen. I will fix it. In the mean time, I believe that setting provider to "local" might convince it to route the jobs through the proper queue. Mihael From tim.g.armstrong at gmail.com Tue Sep 16 09:52:27 2014 From: tim.g.armstrong at gmail.com (Tim Armstrong) Date: Tue, 16 Sep 2014 09:52:27 -0500 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: <1410857446.27823.1.camel@echo> References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo> <1409860986.7960.3.camel@echo> <1409876838.3600.8.camel@echo> <1409939838.12288.3.camel@echo> <1410205101.24345.22.camel@echo> <1410457173.25856.5.camel@echo> <1410459811.27191.5.camel@echo> <1410461068.27191.7.camel@echo> <1410461908.31274.1.camel@echo> <1410467688.638.0.camel@echo> <1410655779.22697.1.camel@echo> <1410857446.27823.1.camel@echo> Message-ID: Would the "local" setting be in the -shared config file? On Tue, Sep 16, 2014 at 3:50 AM, Mihael Hategan wrote: > On Mon, 2014-09-15 at 12:52 -0500, Tim Armstrong wrote: > > So I was trying to get it to run local jobs via a passive service. The > > jobs just seem to be accumulating in the service's queue and not being > run. > > > > Maybe I'm using the wrong job manager - it's being left as NULL, which is > > converted to fork. > > I can see how that would happen. I will fix it. In the mean time, I > believe that setting provider to "local" might convince it to route the > jobs through the proper queue. > > Mihael > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Tue Sep 16 13:32:06 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 16 Sep 2014 11:32:06 -0700 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo> <1409860986.7960.3.camel@echo> <1409876838.3600.8.camel@echo> <1409939838.12288.3.camel@echo> <1410205101.24345.22.camel@echo> <1410457173.25856.5.camel@echo> <1410459811.27191.5.camel@echo> <1410461068.27191.7.camel@echo> <1410461908.31274.1.camel@echo> <1410467688.638.0.camel@echo> <1410655779.22697.1.camel@echo> <1410857446.27823.1.camel@echo> Message-ID: <1410892326.29235.1.camel@echo> On Tue, 2014-09-16 at 09:52 -0500, Tim Armstrong wrote: > Would the "local" setting be in the -shared config file? Shared and passive should be mutually exclusive, although I don't think the code enforces that. I'll make sure it would. But hang on. The whole thing makes no sense. So let me get back to you on it. Mihael From hategan at mcs.anl.gov Tue Sep 16 15:02:19 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 16 Sep 2014 13:02:19 -0700 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: <1410892326.29235.1.camel@echo> References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo> <1409860986.7960.3.camel@echo> <1409876838.3600.8.camel@echo> <1409939838.12288.3.camel@echo> <1410205101.24345.22.camel@echo> <1410457173.25856.5.camel@echo> <1410459811.27191.5.camel@echo> <1410461068.27191.7.camel@echo> <1410461908.31274.1.camel@echo> <1410467688.638.0.camel@echo> <1410655779.22697.1.camel@echo> <1410857446.27823.1.camel@echo> <1410892326.29235.1.camel@echo> Message-ID: <1410897739.30937.1.camel@echo> I take it all back. It looks like submitting jobs to a passive service with empty settings should work. What I do not see in your log is any workers being actually started. How are you starting the workers? Mihael On Tue, 2014-09-16 at 11:32 -0700, Mihael Hategan wrote: > On Tue, 2014-09-16 at 09:52 -0500, Tim Armstrong wrote: > > Would the "local" setting be in the -shared config file? > > Shared and passive should be mutually exclusive, although I don't think > the code enforces that. I'll make sure it would. > > But hang on. The whole thing makes no sense. So let me get back to you > on it. > > Mihael > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From hategan at mcs.anl.gov Tue Sep 16 15:09:49 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 16 Sep 2014 13:09:49 -0700 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: <1410897739.30937.1.camel@echo> References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo> <1409860986.7960.3.camel@echo> <1409876838.3600.8.camel@echo> <1409939838.12288.3.camel@echo> <1410205101.24345.22.camel@echo> <1410457173.25856.5.camel@echo> <1410459811.27191.5.camel@echo> <1410461068.27191.7.camel@echo> <1410461908.31274.1.camel@echo> <1410467688.638.0.camel@echo> <1410655779.22697.1.camel@echo> <1410857446.27823.1.camel@echo> <1410892326.29235.1.camel@echo> <1410897739.30937.1.camel@echo> Message-ID: <1410898189.30937.6.camel@echo> Ok, I see this in coaster-start-service.log: Running /home/tim/ExM/swift-k.git/dist/swift-svn/bin/worker.pl http://127.0.0.1:60566 LOCAL Not given: LOGDIR Looks like that script hasn't been updated in a while. Things probably worked for you before because it wasn't running jobs through workers. I'll try to see what's happening with that. However, and correct me if I'm wrong, I don't see much benefit in this particular case for managing workers through a shell script rather than letting the service do the work. So as far as C client testing goes you could just use -shared. You can, of course, start worker.pl manually until this gets sorted out. Mihael On Tue, 2014-09-16 at 13:02 -0700, Mihael Hategan wrote: > I take it all back. > > It looks like submitting jobs to a passive service with empty settings > should work. > > What I do not see in your log is any workers being actually started. How > are you starting the workers? > > Mihael > > On Tue, 2014-09-16 at 11:32 -0700, Mihael Hategan wrote: > > On Tue, 2014-09-16 at 09:52 -0500, Tim Armstrong wrote: > > > Would the "local" setting be in the -shared config file? > > > > Shared and passive should be mutually exclusive, although I don't think > > the code enforces that. I'll make sure it would. > > > > But hang on. The whole thing makes no sense. So let me get back to you > > on it. > > > > Mihael > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From tim.g.armstrong at gmail.com Tue Sep 16 16:17:26 2014 From: tim.g.armstrong at gmail.com (Tim Armstrong) Date: Tue, 16 Sep 2014 16:17:26 -0500 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: <1410898189.30937.6.camel@echo> References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo> <1409860986.7960.3.camel@echo> <1409876838.3600.8.camel@echo> <1409939838.12288.3.camel@echo> <1410205101.24345.22.camel@echo> <1410457173.25856.5.camel@echo> <1410459811.27191.5.camel@echo> <1410461068.27191.7.camel@echo> <1410461908.31274.1.camel@echo> <1410467688.638.0.camel@echo> <1410655779.22697.1.camel@echo> <1410857446.27823.1.camel@echo> <1410892326.29235.1.camel@echo> <1410897739.30937.1.camel@echo> <1410898189.30937.6.camel@echo> Message-ID: Right, the other mode of running the service probably makes the most sense, I have this in the test suite though since it's the most straightforward way to test the C client running in a passive coaster configuration. On Tue, Sep 16, 2014 at 3:09 PM, Mihael Hategan wrote: > Ok, I see this in coaster-start-service.log: > > Running /home/tim/ExM/swift-k.git/dist/swift-svn/bin/worker.pl > http://127.0.0.1:60566 LOCAL > Not given: LOGDIR > > Looks like that script hasn't been updated in a while. Things probably > worked for you before because it wasn't running jobs through workers. > > I'll try to see what's happening with that. > > However, and correct me if I'm wrong, I don't see much benefit in this > particular case for managing workers through a shell script rather than > letting the service do the work. So as far as C client testing goes you > could just use -shared. > > You can, of course, start worker.pl manually until this gets sorted out. > > Mihael > > On Tue, 2014-09-16 at 13:02 -0700, Mihael Hategan wrote: > > I take it all back. > > > > It looks like submitting jobs to a passive service with empty settings > > should work. > > > > What I do not see in your log is any workers being actually started. How > > are you starting the workers? > > > > Mihael > > > > On Tue, 2014-09-16 at 11:32 -0700, Mihael Hategan wrote: > > > On Tue, 2014-09-16 at 09:52 -0500, Tim Armstrong wrote: > > > > Would the "local" setting be in the -shared config file? > > > > > > Shared and passive should be mutually exclusive, although I don't think > > > the code enforces that. I'll make sure it would. > > > > > > But hang on. The whole thing makes no sense. So let me get back to you > > > on it. > > > > > > Mihael > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Tue Sep 16 16:34:54 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 16 Sep 2014 14:34:54 -0700 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo> <1409860986.7960.3.camel@echo> <1409876838.3600.8.camel@echo> <1409939838.12288.3.camel@echo> <1410205101.24345.22.camel@echo> <1410457173.25856.5.camel@echo> <1410459811.27191.5.camel@echo> <1410461068.27191.7.camel@echo> <1410461908.31274.1.camel@echo> <1410467688.638.0.camel@echo> <1410655779.22697.1.camel@echo> <1410857446.27823.1.camel@echo> <1410892326.29235.1.camel@echo> <1410897739.30937.1.camel@echo> <1410898189.30937.6.camel@echo> Message-ID: <1410903294.32438.0.camel@echo> Do you happen to have an empty/missing WORKER_LOG_DIR in the service config file? Mihael On Tue, 2014-09-16 at 16:17 -0500, Tim Armstrong wrote: > Right, the other mode of running the service probably makes the most sense, > I have this in the test suite though since it's the most straightforward > way to test the C client running in a passive coaster configuration. > > > > On Tue, Sep 16, 2014 at 3:09 PM, Mihael Hategan wrote: > > > Ok, I see this in coaster-start-service.log: > > > > Running /home/tim/ExM/swift-k.git/dist/swift-svn/bin/worker.pl > > http://127.0.0.1:60566 LOCAL > > Not given: LOGDIR > > > > Looks like that script hasn't been updated in a while. Things probably > > worked for you before because it wasn't running jobs through workers. > > > > I'll try to see what's happening with that. > > > > However, and correct me if I'm wrong, I don't see much benefit in this > > particular case for managing workers through a shell script rather than > > letting the service do the work. So as far as C client testing goes you > > could just use -shared. > > > > You can, of course, start worker.pl manually until this gets sorted out. > > > > Mihael > > > > On Tue, 2014-09-16 at 13:02 -0700, Mihael Hategan wrote: > > > I take it all back. > > > > > > It looks like submitting jobs to a passive service with empty settings > > > should work. > > > > > > What I do not see in your log is any workers being actually started. How > > > are you starting the workers? > > > > > > Mihael > > > > > > On Tue, 2014-09-16 at 11:32 -0700, Mihael Hategan wrote: > > > > On Tue, 2014-09-16 at 09:52 -0500, Tim Armstrong wrote: > > > > > Would the "local" setting be in the -shared config file? > > > > > > > > Shared and passive should be mutually exclusive, although I don't think > > > > the code enforces that. I'll make sure it would. > > > > > > > > But hang on. The whole thing makes no sense. So let me get back to you > > > > on it. > > > > > > > > Mihael > > > > > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > From tim.g.armstrong at gmail.com Tue Sep 16 16:43:13 2014 From: tim.g.armstrong at gmail.com (Tim Armstrong) Date: Tue, 16 Sep 2014 16:43:13 -0500 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: <1410903294.32438.0.camel@echo> References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo> <1409860986.7960.3.camel@echo> <1409876838.3600.8.camel@echo> <1409939838.12288.3.camel@echo> <1410205101.24345.22.camel@echo> <1410457173.25856.5.camel@echo> <1410459811.27191.5.camel@echo> <1410461068.27191.7.camel@echo> <1410461908.31274.1.camel@echo> <1410467688.638.0.camel@echo> <1410655779.22697.1.camel@echo> <1410857446.27823.1.camel@echo> <1410892326.29235.1.camel@echo> <1410897739.30937.1.camel@echo> <1410898189.30937.6.camel@echo> <1410903294.32438.0.camel@echo> Message-ID: Yes, it's not set anywhere - should I be setting it to something? On Tue, Sep 16, 2014 at 4:34 PM, Mihael Hategan wrote: > Do you happen to have an empty/missing WORKER_LOG_DIR in the service > config file? > > Mihael > > On Tue, 2014-09-16 at 16:17 -0500, Tim Armstrong wrote: > > Right, the other mode of running the service probably makes the most > sense, > > I have this in the test suite though since it's the most straightforward > > way to test the C client running in a passive coaster configuration. > > > > > > > > On Tue, Sep 16, 2014 at 3:09 PM, Mihael Hategan > wrote: > > > > > Ok, I see this in coaster-start-service.log: > > > > > > Running /home/tim/ExM/swift-k.git/dist/swift-svn/bin/worker.pl > > > http://127.0.0.1:60566 LOCAL > > > Not given: LOGDIR > > > > > > Looks like that script hasn't been updated in a while. Things probably > > > worked for you before because it wasn't running jobs through workers. > > > > > > I'll try to see what's happening with that. > > > > > > However, and correct me if I'm wrong, I don't see much benefit in this > > > particular case for managing workers through a shell script rather than > > > letting the service do the work. So as far as C client testing goes you > > > could just use -shared. > > > > > > You can, of course, start worker.pl manually until this gets sorted > out. > > > > > > Mihael > > > > > > On Tue, 2014-09-16 at 13:02 -0700, Mihael Hategan wrote: > > > > I take it all back. > > > > > > > > It looks like submitting jobs to a passive service with empty > settings > > > > should work. > > > > > > > > What I do not see in your log is any workers being actually started. > How > > > > are you starting the workers? > > > > > > > > Mihael > > > > > > > > On Tue, 2014-09-16 at 11:32 -0700, Mihael Hategan wrote: > > > > > On Tue, 2014-09-16 at 09:52 -0500, Tim Armstrong wrote: > > > > > > Would the "local" setting be in the -shared config file? > > > > > > > > > > Shared and passive should be mutually exclusive, although I don't > think > > > > > the code enforces that. I'll make sure it would. > > > > > > > > > > But hang on. The whole thing makes no sense. So let me get back to > you > > > > > on it. > > > > > > > > > > Mihael > > > > > > > > > > > > > > > _______________________________________________ > > > > > Swift-devel mailing list > > > > > Swift-devel at ci.uchicago.edu > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Tue Sep 16 16:47:20 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 16 Sep 2014 14:47:20 -0700 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo> <1409860986.7960.3.camel@echo> <1409876838.3600.8.camel@echo> <1409939838.12288.3.camel@echo> <1410205101.24345.22.camel@echo> <1410457173.25856.5.camel@echo> <1410459811.27191.5.camel@echo> <1410461068.27191.7.camel@echo> <1410461908.31274.1.camel@echo> <1410467688.638.0.camel@echo> <1410655779.22697.1.camel@echo> <1410857446.27823.1.camel@echo> <1410892326.29235.1.camel@echo> <1410897739.30937.1.camel@echo> <1410898189.30937.6.camel@echo> <1410903294.32438.0.camel@echo> Message-ID: <1410904040.32576.0.camel@echo> Yes. That's what seems to be causing the problem. worker.pl requires it, but start-coaster-service doesn't enforce it nor does it report when workers fail to start. Mihael On Tue, 2014-09-16 at 16:43 -0500, Tim Armstrong wrote: > Yes, it's not set anywhere - should I be setting it to something? > > On Tue, Sep 16, 2014 at 4:34 PM, Mihael Hategan wrote: > > > Do you happen to have an empty/missing WORKER_LOG_DIR in the service > > config file? > > > > Mihael > > > > On Tue, 2014-09-16 at 16:17 -0500, Tim Armstrong wrote: > > > Right, the other mode of running the service probably makes the most > > sense, > > > I have this in the test suite though since it's the most straightforward > > > way to test the C client running in a passive coaster configuration. > > > > > > > > > > > > On Tue, Sep 16, 2014 at 3:09 PM, Mihael Hategan > > wrote: > > > > > > > Ok, I see this in coaster-start-service.log: > > > > > > > > Running /home/tim/ExM/swift-k.git/dist/swift-svn/bin/worker.pl > > > > http://127.0.0.1:60566 LOCAL > > > > Not given: LOGDIR > > > > > > > > Looks like that script hasn't been updated in a while. Things probably > > > > worked for you before because it wasn't running jobs through workers. > > > > > > > > I'll try to see what's happening with that. > > > > > > > > However, and correct me if I'm wrong, I don't see much benefit in this > > > > particular case for managing workers through a shell script rather than > > > > letting the service do the work. So as far as C client testing goes you > > > > could just use -shared. > > > > > > > > You can, of course, start worker.pl manually until this gets sorted > > out. > > > > > > > > Mihael > > > > > > > > On Tue, 2014-09-16 at 13:02 -0700, Mihael Hategan wrote: > > > > > I take it all back. > > > > > > > > > > It looks like submitting jobs to a passive service with empty > > settings > > > > > should work. > > > > > > > > > > What I do not see in your log is any workers being actually started. > > How > > > > > are you starting the workers? > > > > > > > > > > Mihael > > > > > > > > > > On Tue, 2014-09-16 at 11:32 -0700, Mihael Hategan wrote: > > > > > > On Tue, 2014-09-16 at 09:52 -0500, Tim Armstrong wrote: > > > > > > > Would the "local" setting be in the -shared config file? > > > > > > > > > > > > Shared and passive should be mutually exclusive, although I don't > > think > > > > > > the code enforces that. I'll make sure it would. > > > > > > > > > > > > But hang on. The whole thing makes no sense. So let me get back to > > you > > > > > > on it. > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > Swift-devel mailing list > > > > > > Swift-devel at ci.uchicago.edu > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > _______________________________________________ > > > > > Swift-devel mailing list > > > > > Swift-devel at ci.uchicago.edu > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > From tim.g.armstrong at gmail.com Tue Sep 16 17:13:00 2014 From: tim.g.armstrong at gmail.com (Tim Armstrong) Date: Tue, 16 Sep 2014 17:13:00 -0500 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: <1410904040.32576.0.camel@echo> References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo> <1409860986.7960.3.camel@echo> <1409876838.3600.8.camel@echo> <1409939838.12288.3.camel@echo> <1410205101.24345.22.camel@echo> <1410457173.25856.5.camel@echo> <1410459811.27191.5.camel@echo> <1410461068.27191.7.camel@echo> <1410461908.31274.1.camel@echo> <1410467688.638.0.camel@echo> <1410655779.22697.1.camel@echo> <1410857446.27823.1.camel@echo> <1410892326.29235.1.camel@echo> <1410897739.30937.1.camel@echo> <1410898189.30937.6.camel@echo> <1410903294.32438.0.camel@echo> <1410904040.32576.0.camel@echo> Message-ID: Thanks, seems to be working now. - Tim On Tue, Sep 16, 2014 at 4:47 PM, Mihael Hategan wrote: > Yes. That's what seems to be causing the problem. worker.pl requires it, > but start-coaster-service doesn't enforce it nor does it report when > workers fail to start. > > Mihael > > On Tue, 2014-09-16 at 16:43 -0500, Tim Armstrong wrote: > > Yes, it's not set anywhere - should I be setting it to something? > > > > On Tue, Sep 16, 2014 at 4:34 PM, Mihael Hategan > wrote: > > > > > Do you happen to have an empty/missing WORKER_LOG_DIR in the service > > > config file? > > > > > > Mihael > > > > > > On Tue, 2014-09-16 at 16:17 -0500, Tim Armstrong wrote: > > > > Right, the other mode of running the service probably makes the most > > > sense, > > > > I have this in the test suite though since it's the most > straightforward > > > > way to test the C client running in a passive coaster configuration. > > > > > > > > > > > > > > > > On Tue, Sep 16, 2014 at 3:09 PM, Mihael Hategan > > > > wrote: > > > > > > > > > Ok, I see this in coaster-start-service.log: > > > > > > > > > > Running /home/tim/ExM/swift-k.git/dist/swift-svn/bin/worker.pl > > > > > http://127.0.0.1:60566 LOCAL > > > > > Not given: LOGDIR > > > > > > > > > > Looks like that script hasn't been updated in a while. Things > probably > > > > > worked for you before because it wasn't running jobs through > workers. > > > > > > > > > > I'll try to see what's happening with that. > > > > > > > > > > However, and correct me if I'm wrong, I don't see much benefit in > this > > > > > particular case for managing workers through a shell script rather > than > > > > > letting the service do the work. So as far as C client testing > goes you > > > > > could just use -shared. > > > > > > > > > > You can, of course, start worker.pl manually until this gets > sorted > > > out. > > > > > > > > > > Mihael > > > > > > > > > > On Tue, 2014-09-16 at 13:02 -0700, Mihael Hategan wrote: > > > > > > I take it all back. > > > > > > > > > > > > It looks like submitting jobs to a passive service with empty > > > settings > > > > > > should work. > > > > > > > > > > > > What I do not see in your log is any workers being actually > started. > > > How > > > > > > are you starting the workers? > > > > > > > > > > > > Mihael > > > > > > > > > > > > On Tue, 2014-09-16 at 11:32 -0700, Mihael Hategan wrote: > > > > > > > On Tue, 2014-09-16 at 09:52 -0500, Tim Armstrong wrote: > > > > > > > > Would the "local" setting be in the -shared config file? > > > > > > > > > > > > > > Shared and passive should be mutually exclusive, although I > don't > > > think > > > > > > > the code enforces that. I'll make sure it would. > > > > > > > > > > > > > > But hang on. The whole thing makes no sense. So let me get > back to > > > you > > > > > > > on it. > > > > > > > > > > > > > > Mihael > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > Swift-devel mailing list > > > > > > > Swift-devel at ci.uchicago.edu > > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > Swift-devel mailing list > > > > > > Swift-devel at ci.uchicago.edu > > > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > > > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ketan at mcs.anl.gov Fri Sep 19 13:15:36 2014 From: ketan at mcs.anl.gov (Ketan Maheshwari) Date: Fri, 19 Sep 2014 13:15:36 -0500 Subject: [Swift-devel] worker logs Message-ID: Hi, A question about worker logs with Swift 0.95 automatic coasters: With these lines in sites.xml: DEBUG /tmp/workerlog I get worker logs in the said directory for local:local on my localhost. However, when trying the same for local:cobalt on BlueGene, I do not get any worker logs. Any clues as to what could be the reason for not getting worker logs on BlueGene? Compute nodes can write to workerLoggingDirectory so I do not thing it is an issue. Thanks, Ketan -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at anl.gov Fri Sep 19 13:18:32 2014 From: wilde at anl.gov (Michael Wilde) Date: Fri, 19 Sep 2014 13:18:32 -0500 Subject: [Swift-devel] worker logs In-Reply-To: References: Message-ID: <541C7378.50604@anl.gov> Perhaps an environment variable (or argument?) to turn on worker logging is not getting through the cobalt provider and/or cobalt to the worker? On 9/19/14, 1:15 PM, Ketan Maheshwari wrote: > Hi, > > A question about worker logs with Swift 0.95 automatic coasters: > > With these lines in sites.xml: > > DEBUG > key="workerLoggingDirectory">/tmp/workerlog > > I get worker logs in the said directory for local:local on my localhost. > > However, when trying the same for local:cobalt on BlueGene, I do not > get any worker logs. > > Any clues as to what could be the reason for not getting worker logs > on BlueGene? Compute nodes can write to workerLoggingDirectory so I do > not thing it is an issue. > > Thanks, > Ketan > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Mathematics and Computer Science Computation Institute Argonne National Laboratory The University of Chicago -------------- next part -------------- An HTML attachment was scrubbed... URL: From wozniak at mcs.anl.gov Fri Sep 19 13:46:54 2014 From: wozniak at mcs.anl.gov (Justin M Wozniak) Date: Fri, 19 Sep 2014 13:46:54 -0500 Subject: [Swift-devel] worker logs In-Reply-To: <541C7378.50604@anl.gov> References: <541C7378.50604@anl.gov> Message-ID: <541C7A1E.3070008@mcs.anl.gov> Are you running workers on the BG/Q compute nodes? Are you sure the workers are starting? You may want to put some output calls in the start of worker.pl to see if they start. On 09/19/2014 01:18 PM, Michael Wilde wrote: > Perhaps an environment variable (or argument?) to turn on worker > logging is not getting through the cobalt provider and/or cobalt to > the worker? > > > On 9/19/14, 1:15 PM, Ketan Maheshwari wrote: >> Hi, >> >> A question about worker logs with Swift 0.95 automatic coasters: >> >> With these lines in sites.xml: >> >> DEBUG >> > key="workerLoggingDirectory">/tmp/workerlog >> >> I get worker logs in the said directory for local:local on my localhost. >> >> However, when trying the same for local:cobalt on BlueGene, I do not >> get any worker logs. >> >> Any clues as to what could be the reason for not getting worker logs >> on BlueGene? Compute nodes can write to workerLoggingDirectory so I >> do not thing it is an issue. >> >> Thanks, >> Ketan >> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -- > Michael Wilde > Mathematics and Computer Science Computation Institute > Argonne National Laboratory The University of Chicago > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Justin M Wozniak -------------- next part -------------- An HTML attachment was scrubbed... URL: From ketan at mcs.anl.gov Fri Sep 19 13:54:06 2014 From: ketan at mcs.anl.gov (Ketan Maheshwari) Date: Fri, 19 Sep 2014 13:54:06 -0500 Subject: [Swift-devel] worker logs In-Reply-To: <541C7A1E.3070008@mcs.anl.gov> References: <541C7378.50604@anl.gov> <541C7A1E.3070008@mcs.anl.gov> Message-ID: Yes, I know the workers start because in an earlier modification in worker.pl I had a syntax error and the error message showed up on stderr. After, correcting the error, things actually run with provider coaster and mode local:cobalt so I assume the workers are running alright. With timestamped logs, I want to see if the jobs are spawned at right intervals as I intend with these mods. On Fri, Sep 19, 2014 at 1:46 PM, Justin M Wozniak wrote: > > Are you running workers on the BG/Q compute nodes? > > Are you sure the workers are starting? You may want to put some output > calls in the start of worker.pl to see if they start. > > On 09/19/2014 01:18 PM, Michael Wilde wrote: > > Perhaps an environment variable (or argument?) to turn on worker logging > is not getting through the cobalt provider and/or cobalt to the worker? > > > On 9/19/14, 1:15 PM, Ketan Maheshwari wrote: > > Hi, > > A question about worker logs with Swift 0.95 automatic coasters: > > With these lines in sites.xml: > > DEBUG > key="workerLoggingDirectory">/tmp/workerlog > > I get worker logs in the said directory for local:local on my localhost. > > However, when trying the same for local:cobalt on BlueGene, I do not get > any worker logs. > > Any clues as to what could be the reason for not getting worker logs on > BlueGene? Compute nodes can write to workerLoggingDirectory so I do not > thing it is an issue. > > Thanks, > Ketan > > > _______________________________________________ > Swift-devel mailing listSwift-devel at ci.uchicago.eduhttps://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > -- > Michael Wilde > Mathematics and Computer Science Computation Institute > Argonne National Laboratory The University of Chicago > > > > _______________________________________________ > Swift-devel mailing listSwift-devel at ci.uchicago.eduhttps://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > -- > Justin M Wozniak > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wilde at anl.gov Fri Sep 19 13:57:58 2014 From: wilde at anl.gov (Michael Wilde) Date: Fri, 19 Sep 2014 13:57:58 -0500 Subject: [Swift-devel] worker logs In-Reply-To: References: <541C7378.50604@anl.gov> <541C7A1E.3070008@mcs.anl.gov> Message-ID: <541C7CB6.4070901@anl.gov> You can get close-enough data from the main Swift .log file. - Mike On 9/19/14, 1:54 PM, Ketan Maheshwari wrote: > Yes, I know the workers start because in an earlier modification in > worker.pl I had a syntax error and the error > message showed up on stderr. > > After, correcting the error, things actually run with provider coaster > and mode local:cobalt so I assume the workers are running alright. > With timestamped logs, I want to see if the jobs are spawned at right > intervals as I intend with these mods. > > On Fri, Sep 19, 2014 at 1:46 PM, Justin M Wozniak > wrote: > > > Are you running workers on the BG/Q compute nodes? > > Are you sure the workers are starting? You may want to put some > output calls in the start of worker.pl to see > if they start. > > On 09/19/2014 01:18 PM, Michael Wilde wrote: >> Perhaps an environment variable (or argument?) to turn on worker >> logging is not getting through the cobalt provider and/or cobalt >> to the worker? >> >> >> On 9/19/14, 1:15 PM, Ketan Maheshwari wrote: >>> Hi, >>> >>> A question about worker logs with Swift 0.95 automatic coasters: >>> >>> With these lines in sites.xml: >>> >>> DEBUG >>> >> key="workerLoggingDirectory">/tmp/workerlog >>> >>> I get worker logs in the said directory for local:local on my >>> localhost. >>> >>> However, when trying the same for local:cobalt on BlueGene, I do >>> not get any worker logs. >>> >>> Any clues as to what could be the reason for not getting worker >>> logs on BlueGene? Compute nodes can write to >>> workerLoggingDirectory so I do not thing it is an issue. >>> >>> Thanks, >>> Ketan >>> >>> >>> _______________________________________________ >>> Swift-devel mailing list >>> Swift-devel at ci.uchicago.edu >>> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel >> >> -- >> Michael Wilde >> Mathematics and Computer Science Computation Institute >> Argonne National Laboratory The University of Chicago >> >> >> _______________________________________________ >> Swift-devel mailing list >> Swift-devel at ci.uchicago.edu >> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > -- > Justin M Wozniak > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel -- Michael Wilde Mathematics and Computer Science Computation Institute Argonne National Laboratory The University of Chicago -------------- next part -------------- An HTML attachment was scrubbed... URL: From davidkelly at uchicago.edu Mon Sep 22 13:50:10 2014 From: davidkelly at uchicago.edu (David Kelly) Date: Mon, 22 Sep 2014 13:50:10 -0500 Subject: [Swift-devel] Handling failures with job directory creation Message-ID: When running psims on Midway, we set our scratch directory set to /scratch/local (a local disk mounted on each node). Occasionally /scratch/local gets full or becomes unmounted. When this happens, jobs are quickly and repeatedly sent to this bad node and get marked as failed. Here are some ideas about how Swift could handle this better: The Swift/swiftwrap error messages don't identify which node the directory creation failed on, which makes it difficult to report these errors to cluster admins. If swiftwrap fails to create a job directory, the node could get marked as 'bad' and prevent jobs from running there. An alternative would be to have a rule says, if using more than one node, never re-run a failed task on the same node. It could still be possible for a task to hit multiple bad nodes, but much less likely. -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Tue Sep 23 18:09:00 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 23 Sep 2014 16:09:00 -0700 Subject: [Swift-devel] Handling failures with job directory creation In-Reply-To: References: Message-ID: <1411513740.5958.0.camel@echo> Is this with coasters? Mihael On Mon, 2014-09-22 at 13:50 -0500, David Kelly wrote: > When running psims on Midway, we set our scratch directory set to > /scratch/local (a local disk mounted on each node). Occasionally > /scratch/local gets full or becomes unmounted. When this happens, jobs are > quickly and repeatedly sent to this bad node and get marked as failed. > > Here are some ideas about how Swift could handle this better: > > The Swift/swiftwrap error messages don't identify which node the directory > creation failed on, which makes it difficult to report these errors to > cluster admins. > > If swiftwrap fails to create a job directory, the node could get marked as > 'bad' and prevent jobs from running there. > > An alternative would be to have a rule says, if using more than one node, > never re-run a failed task on the same node. It could still be possible for > a task to hit multiple bad nodes, but much less likely. > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel From davidkelly at uchicago.edu Tue Sep 23 19:52:15 2014 From: davidkelly at uchicago.edu (David Kelly) Date: Tue, 23 Sep 2014 19:52:15 -0500 Subject: [Swift-devel] Handling failures with job directory creation In-Reply-To: <1411513740.5958.0.camel@echo> References: <1411513740.5958.0.camel@echo> Message-ID: Yep, it's with coasters local:slurm On Tue, Sep 23, 2014 at 6:09 PM, Mihael Hategan wrote: > Is this with coasters? > > Mihael > > On Mon, 2014-09-22 at 13:50 -0500, David Kelly wrote: > > When running psims on Midway, we set our scratch directory set to > > /scratch/local (a local disk mounted on each node). Occasionally > > /scratch/local gets full or becomes unmounted. When this happens, jobs > are > > quickly and repeatedly sent to this bad node and get marked as failed. > > > > Here are some ideas about how Swift could handle this better: > > > > The Swift/swiftwrap error messages don't identify which node the > directory > > creation failed on, which makes it difficult to report these errors to > > cluster admins. > > > > If swiftwrap fails to create a job directory, the node could get marked > as > > 'bad' and prevent jobs from running there. > > > > An alternative would be to have a rule says, if using more than one node, > > never re-run a failed task on the same node. It could still be possible > for > > a task to hit multiple bad nodes, but much less likely. > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Tue Sep 23 20:37:33 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 23 Sep 2014 18:37:33 -0700 Subject: [Swift-devel] Handling failures with job directory creation In-Reply-To: References: <1411513740.5958.0.camel@echo> Message-ID: <1411522653.7472.7.camel@echo> Right. It's a known problem. There is currently a quality measure for nodes which I think depends on failure rate and workers with higher quality are picked first if available. But this does not prevent bad nodes from being used if no good nodes are available. We could do something similar to what the swift scheduler does, which is to blacklist bad nodes for a certain duration (an exponential back-off sort of thing). As for the _swiftwrap messages, please feel free to experiment with the info() sub. In trunk, the job-to-node mapping information should be in the log and the log tools do use it as far as I remember. Mihael On Tue, 2014-09-23 at 19:52 -0500, David Kelly wrote: > Yep, it's with coasters local:slurm > > On Tue, Sep 23, 2014 at 6:09 PM, Mihael Hategan wrote: > > > Is this with coasters? > > > > Mihael > > > > On Mon, 2014-09-22 at 13:50 -0500, David Kelly wrote: > > > When running psims on Midway, we set our scratch directory set to > > > /scratch/local (a local disk mounted on each node). Occasionally > > > /scratch/local gets full or becomes unmounted. When this happens, jobs > > are > > > quickly and repeatedly sent to this bad node and get marked as failed. > > > > > > Here are some ideas about how Swift could handle this better: > > > > > > The Swift/swiftwrap error messages don't identify which node the > > directory > > > creation failed on, which makes it difficult to report these errors to > > > cluster admins. > > > > > > If swiftwrap fails to create a job directory, the node could get marked > > as > > > 'bad' and prevent jobs from running there. > > > > > > An alternative would be to have a rule says, if using more than one node, > > > never re-run a failed task on the same node. It could still be possible > > for > > > a task to hit multiple bad nodes, but much less likely. > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > From davidkelly at uchicago.edu Tue Sep 23 20:44:36 2014 From: davidkelly at uchicago.edu (David Kelly) Date: Tue, 23 Sep 2014 20:44:36 -0500 Subject: [Swift-devel] Handling failures with job directory creation In-Reply-To: <1411522653.7472.7.camel@echo> References: <1411513740.5958.0.camel@echo> <1411522653.7472.7.camel@echo> Message-ID: Thanks, I'll file this as a ticket/future improvement item On Tue, Sep 23, 2014 at 8:37 PM, Mihael Hategan wrote: > Right. It's a known problem. > > There is currently a quality measure for nodes which I think depends on > failure rate and workers with higher quality are picked first if > available. But this does not prevent bad nodes from being used if no > good nodes are available. > > We could do something similar to what the swift scheduler does, which is > to blacklist bad nodes for a certain duration (an exponential back-off > sort of thing). > > As for the _swiftwrap messages, please feel free to experiment with the > info() sub. In trunk, the job-to-node mapping information should be in > the log and the log tools do use it as far as I remember. > > Mihael > > On Tue, 2014-09-23 at 19:52 -0500, David Kelly wrote: > > Yep, it's with coasters local:slurm > > > > On Tue, Sep 23, 2014 at 6:09 PM, Mihael Hategan > wrote: > > > > > Is this with coasters? > > > > > > Mihael > > > > > > On Mon, 2014-09-22 at 13:50 -0500, David Kelly wrote: > > > > When running psims on Midway, we set our scratch directory set to > > > > /scratch/local (a local disk mounted on each node). Occasionally > > > > /scratch/local gets full or becomes unmounted. When this happens, > jobs > > > are > > > > quickly and repeatedly sent to this bad node and get marked as > failed. > > > > > > > > Here are some ideas about how Swift could handle this better: > > > > > > > > The Swift/swiftwrap error messages don't identify which node the > > > directory > > > > creation failed on, which makes it difficult to report these errors > to > > > > cluster admins. > > > > > > > > If swiftwrap fails to create a job directory, the node could get > marked > > > as > > > > 'bad' and prevent jobs from running there. > > > > > > > > An alternative would be to have a rule says, if using more than one > node, > > > > never re-run a failed task on the same node. It could still be > possible > > > for > > > > a task to hit multiple bad nodes, but much less likely. > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: