From ketan at mcs.anl.gov Tue Sep 2 15:01:52 2014 From: ketan at mcs.anl.gov (Ketan Maheshwari) Date: Tue, 2 Sep 2014 15:01:52 -0500 Subject: [Swift-devel] error building swift trunk on bluegene In-Reply-To: <1408831365.24566.1.camel@echo> References:

<1408831365.24566.1.camel@echo> Message-ID: The error seems to be persisting on bluegene with the latest trunk. On Sat, Aug 23, 2014 at 5:02 PM, Mihael Hategan wrote: > That particular issue should now be fixed in trunk. I'm able to build > swift with the ibm libraries (although within eclipse, so please give > this a shot in a real environment). > > Mihael > > On Fri, 2014-08-22 at 13:44 -0500, Ketan Maheshwari wrote: > > PS. The current java is from IBM: > > > > $ java -version > > java version "1.6.0" > > Java(TM) SE Runtime Environment (build pxp6460sr15-20131017_01(SR15)) > > IBM J9 VM (build 2.4, JRE 1.6.0 IBM J9 2.4 Linux ppc64-64 > > jvmxp6460sr15-20131016_170922 (JIT enabled, AOT enabled) > > J9VM - 20131016_170922 > > JIT - r9_20130920_46510ifx2 > > GC - GA24_Java6_SR15_20131016_1337_B170922) > > JCL - 20131015_01 > > > > > > On Fri, Aug 22, 2014 at 1:41 PM, Ketan Maheshwari > wrote: > > > > > Hi, > > > > > > I am getting this error trying to build Swift trunk on BG Vesta: > > > > > > compile: > > > [echo] [util]: COMPILE > > > [mkdir] Created dir: > > > /gpfs/vesta-home/ketan/swift-devel/cog/modules/util/build > > > [javac] Compiling 56 source files to > > > /gpfs/vesta-home/ketan/swift-devel/cog/modules/util/build > > > [javac] > > > > /gpfs/vesta-home/ketan/swift-devel/cog/modules/util/src/org/globus/cog/util/concurrent/FileLock.java:91: > > > cannot find symbol > > > [javac] symbol : class VMManagement > > > [javac] location: package sun.management > > > [javac] sun.management.VMManagement mgmt = > > > (sun.management.VMManagement) jvm.get(runtime); > > > [javac] ^ > > > [javac] > > > > /gpfs/vesta-home/ketan/swift-devel/cog/modules/util/src/org/globus/cog/util/concurrent/FileLock.java:91: > > > cannot find symbol > > > [javac] symbol : class VMManagement > > > [javac] location: package sun.management > > > [javac] sun.management.VMManagement mgmt = > > > (sun.management.VMManagement) jvm.get(runtime); > > > [javac] ^ > > > [javac] Note: Some input files use or override a deprecated API. > > > [javac] Note: Recompile with -Xlint:deprecation for details. > > > [javac] Note: Some input files use unchecked or unsafe operations. > > > [javac] Note: Recompile with -Xlint:unchecked for details. > > > [javac] 2 errors > > > > > > Doe this means I need sun java on Vesta? Does anyone know if Sun java > is > > > available for PPC64 architecture? > > > > > > Thanks, > > > Ketan > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Tue Sep 2 17:11:27 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Tue, 2 Sep 2014 15:11:27 -0700 Subject: [Swift-devel] error building swift trunk on bluegene In-Reply-To: References:

<1408831365.24566.1.camel@echo> Message-ID: <1409695887.16233.14.camel@echo> Unless we have different definitions of "latest trunk", I don't think that's possible. See https://github.com/swift-lang/swift-k/blob/master/cogkit/modules/util/src/org/globus/cog/util/concurrent/FileLock.java Line 91 is this: --------------- 91: } --------------- Mihael On Tue, 2014-09-02 at 15:01 -0500, Ketan Maheshwari wrote: > The error seems to be persisting on bluegene with the latest trunk. > > > On Sat, Aug 23, 2014 at 5:02 PM, Mihael Hategan wrote: > > > That particular issue should now be fixed in trunk. I'm able to build > > swift with the ibm libraries (although within eclipse, so please give > > this a shot in a real environment). > > > > Mihael > > > > On Fri, 2014-08-22 at 13:44 -0500, Ketan Maheshwari wrote: > > > PS. The current java is from IBM: > > > > > > $ java -version > > > java version "1.6.0" > > > Java(TM) SE Runtime Environment (build pxp6460sr15-20131017_01(SR15)) > > > IBM J9 VM (build 2.4, JRE 1.6.0 IBM J9 2.4 Linux ppc64-64 > > > jvmxp6460sr15-20131016_170922 (JIT enabled, AOT enabled) > > > J9VM - 20131016_170922 > > > JIT - r9_20130920_46510ifx2 > > > GC - GA24_Java6_SR15_20131016_1337_B170922) > > > JCL - 20131015_01 > > > > > > > > > On Fri, Aug 22, 2014 at 1:41 PM, Ketan Maheshwari > > wrote: > > > > > > > Hi, > > > > > > > > I am getting this error trying to build Swift trunk on BG Vesta: > > > > > > > > compile: > > > > [echo] [util]: COMPILE > > > > [mkdir] Created dir: > > > > /gpfs/vesta-home/ketan/swift-devel/cog/modules/util/build > > > > [javac] Compiling 56 source files to > > > > /gpfs/vesta-home/ketan/swift-devel/cog/modules/util/build > > > > [javac] > > > > > > /gpfs/vesta-home/ketan/swift-devel/cog/modules/util/src/org/globus/cog/util/concurrent/FileLock.java:91: > > > > cannot find symbol > > > > [javac] symbol : class VMManagement > > > > [javac] location: package sun.management > > > > [javac] sun.management.VMManagement mgmt = > > > > (sun.management.VMManagement) jvm.get(runtime); > > > > [javac] ^ > > > > [javac] > > > > > > /gpfs/vesta-home/ketan/swift-devel/cog/modules/util/src/org/globus/cog/util/concurrent/FileLock.java:91: > > > > cannot find symbol > > > > [javac] symbol : class VMManagement > > > > [javac] location: package sun.management > > > > [javac] sun.management.VMManagement mgmt = > > > > (sun.management.VMManagement) jvm.get(runtime); > > > > [javac] ^ > > > > [javac] Note: Some input files use or override a deprecated API. > > > > [javac] Note: Recompile with -Xlint:deprecation for details. > > > > [javac] Note: Some input files use unchecked or unsafe operations. > > > > [javac] Note: Recompile with -Xlint:unchecked for details. > > > > [javac] 2 errors > > > > > > > > Doe this means I need sun java on Vesta? Does anyone know if Sun java > > is > > > > available for PPC64 architecture? > > > > > > > > Thanks, > > > > Ketan > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > From ketan at mcs.anl.gov Wed Sep 3 09:42:33 2014 From: ketan at mcs.anl.gov (Ketan Maheshwari) Date: Wed, 3 Sep 2014 09:42:33 -0500 Subject: [Swift-devel] error building swift trunk on bluegene In-Reply-To: <1409695887.16233.14.camel@echo> References:

<1408831365.24566.1.camel@echo> <1409695887.16233.14.camel@echo> Message-ID: I was using svn repo on bluegene. Trying with git repo, I am unable to build: $ pwd /home/ketan/swift-k $ ant redist Buildfile: build.xml cleanGenerated: BUILD FAILED /gpfs/vesta-home/ketan/swift-k/build.xml:402: Directory does not exist:/gpfs/vesta-home/ketan/swift-k/src/org/griphyn/vdl/model Total time: 0 seconds On Tue, Sep 2, 2014 at 5:11 PM, Mihael Hategan wrote: > Unless we have different definitions of "latest trunk", I don't think > that's possible. > > See > > https://github.com/swift-lang/swift-k/blob/master/cogkit/modules/util/src/org/globus/cog/util/concurrent/FileLock.java > > Line 91 is this: > --------------- > 91: } > --------------- > > Mihael > > On Tue, 2014-09-02 at 15:01 -0500, Ketan Maheshwari wrote: > > The error seems to be persisting on bluegene with the latest trunk. > > > > > > On Sat, Aug 23, 2014 at 5:02 PM, Mihael Hategan > wrote: > > > > > That particular issue should now be fixed in trunk. I'm able to build > > > swift with the ibm libraries (although within eclipse, so please give > > > this a shot in a real environment). > > > > > > Mihael > > > > > > On Fri, 2014-08-22 at 13:44 -0500, Ketan Maheshwari wrote: > > > > PS. The current java is from IBM: > > > > > > > > $ java -version > > > > java version "1.6.0" > > > > Java(TM) SE Runtime Environment (build pxp6460sr15-20131017_01(SR15)) > > > > IBM J9 VM (build 2.4, JRE 1.6.0 IBM J9 2.4 Linux ppc64-64 > > > > jvmxp6460sr15-20131016_170922 (JIT enabled, AOT enabled) > > > > J9VM - 20131016_170922 > > > > JIT - r9_20130920_46510ifx2 > > > > GC - GA24_Java6_SR15_20131016_1337_B170922) > > > > JCL - 20131015_01 > > > > > > > > > > > > On Fri, Aug 22, 2014 at 1:41 PM, Ketan Maheshwari > > > > wrote: > > > > > > > > > Hi, > > > > > > > > > > I am getting this error trying to build Swift trunk on BG Vesta: > > > > > > > > > > compile: > > > > > [echo] [util]: COMPILE > > > > > [mkdir] Created dir: > > > > > /gpfs/vesta-home/ketan/swift-devel/cog/modules/util/build > > > > > [javac] Compiling 56 source files to > > > > > /gpfs/vesta-home/ketan/swift-devel/cog/modules/util/build > > > > > [javac] > > > > > > > > > /gpfs/vesta-home/ketan/swift-devel/cog/modules/util/src/org/globus/cog/util/concurrent/FileLock.java:91: > > > > > cannot find symbol > > > > > [javac] symbol : class VMManagement > > > > > [javac] location: package sun.management > > > > > [javac] sun.management.VMManagement mgmt = > > > > > (sun.management.VMManagement) jvm.get(runtime); > > > > > [javac] ^ > > > > > [javac] > > > > > > > > > /gpfs/vesta-home/ketan/swift-devel/cog/modules/util/src/org/globus/cog/util/concurrent/FileLock.java:91: > > > > > cannot find symbol > > > > > [javac] symbol : class VMManagement > > > > > [javac] location: package sun.management > > > > > [javac] sun.management.VMManagement mgmt = > > > > > (sun.management.VMManagement) jvm.get(runtime); > > > > > [javac] > ^ > > > > > [javac] Note: Some input files use or override a deprecated > API. > > > > > [javac] Note: Recompile with -Xlint:deprecation for details. > > > > > [javac] Note: Some input files use unchecked or unsafe > operations. > > > > > [javac] Note: Recompile with -Xlint:unchecked for details. > > > > > [javac] 2 errors > > > > > > > > > > Doe this means I need sun java on Vesta? Does anyone know if Sun > java > > > is > > > > > available for PPC64 architecture? > > > > > > > > > > Thanks, > > > > > Ketan > > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > _______________________________________________ > > > Swift-devel mailing list > > > Swift-devel at ci.uchicago.edu > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > _______________________________________________ > Swift-devel mailing list > Swift-devel at ci.uchicago.edu > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Wed Sep 3 13:45:25 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 3 Sep 2014 11:45:25 -0700 Subject: [Swift-devel] error building swift trunk on bluegene In-Reply-To: References:

<1408831365.24566.1.camel@echo> <1409695887.16233.14.camel@echo> Message-ID: <1409769925.14892.0.camel@echo> "ant dist" You are not re-building an already compiled swift, you are building one from scratch. Mihael On Wed, 2014-09-03 at 09:42 -0500, Ketan Maheshwari wrote: > I was using svn repo on bluegene. Trying with git repo, I am unable to > build: > > $ pwd > /home/ketan/swift-k > > $ ant redist > Buildfile: build.xml > > cleanGenerated: > > BUILD FAILED > /gpfs/vesta-home/ketan/swift-k/build.xml:402: Directory does not > exist:/gpfs/vesta-home/ketan/swift-k/src/org/griphyn/vdl/model > > Total time: 0 seconds > > > On Tue, Sep 2, 2014 at 5:11 PM, Mihael Hategan wrote: > > > Unless we have different definitions of "latest trunk", I don't think > > that's possible. > > > > See > > > > https://github.com/swift-lang/swift-k/blob/master/cogkit/modules/util/src/org/globus/cog/util/concurrent/FileLock.java > > > > Line 91 is this: > > --------------- > > 91: } > > --------------- > > > > Mihael > > > > On Tue, 2014-09-02 at 15:01 -0500, Ketan Maheshwari wrote: > > > The error seems to be persisting on bluegene with the latest trunk. > > > > > > > > > On Sat, Aug 23, 2014 at 5:02 PM, Mihael Hategan > > wrote: > > > > > > > That particular issue should now be fixed in trunk. I'm able to build > > > > swift with the ibm libraries (although within eclipse, so please give > > > > this a shot in a real environment). > > > > > > > > Mihael > > > > > > > > On Fri, 2014-08-22 at 13:44 -0500, Ketan Maheshwari wrote: > > > > > PS. The current java is from IBM: > > > > > > > > > > $ java -version > > > > > java version "1.6.0" > > > > > Java(TM) SE Runtime Environment (build pxp6460sr15-20131017_01(SR15)) > > > > > IBM J9 VM (build 2.4, JRE 1.6.0 IBM J9 2.4 Linux ppc64-64 > > > > > jvmxp6460sr15-20131016_170922 (JIT enabled, AOT enabled) > > > > > J9VM - 20131016_170922 > > > > > JIT - r9_20130920_46510ifx2 > > > > > GC - GA24_Java6_SR15_20131016_1337_B170922) > > > > > JCL - 20131015_01 > > > > > > > > > > > > > > > On Fri, Aug 22, 2014 at 1:41 PM, Ketan Maheshwari > > > > > > wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > I am getting this error trying to build Swift trunk on BG Vesta: > > > > > > > > > > > > compile: > > > > > > [echo] [util]: COMPILE > > > > > > [mkdir] Created dir: > > > > > > /gpfs/vesta-home/ketan/swift-devel/cog/modules/util/build > > > > > > [javac] Compiling 56 source files to > > > > > > /gpfs/vesta-home/ketan/swift-devel/cog/modules/util/build > > > > > > [javac] > > > > > > > > > > > > /gpfs/vesta-home/ketan/swift-devel/cog/modules/util/src/org/globus/cog/util/concurrent/FileLock.java:91: > > > > > > cannot find symbol > > > > > > [javac] symbol : class VMManagement > > > > > > [javac] location: package sun.management > > > > > > [javac] sun.management.VMManagement mgmt = > > > > > > (sun.management.VMManagement) jvm.get(runtime); > > > > > > [javac] ^ > > > > > > [javac] > > > > > > > > > > > > /gpfs/vesta-home/ketan/swift-devel/cog/modules/util/src/org/globus/cog/util/concurrent/FileLock.java:91: > > > > > > cannot find symbol > > > > > > [javac] symbol : class VMManagement > > > > > > [javac] location: package sun.management > > > > > > [javac] sun.management.VMManagement mgmt = > > > > > > (sun.management.VMManagement) jvm.get(runtime); > > > > > > [javac] > > ^ > > > > > > [javac] Note: Some input files use or override a deprecated > > API. > > > > > > [javac] Note: Recompile with -Xlint:deprecation for details. > > > > > > [javac] Note: Some input files use unchecked or unsafe > > operations. > > > > > > [javac] Note: Recompile with -Xlint:unchecked for details. > > > > > > [javac] 2 errors > > > > > > > > > > > > Doe this means I need sun java on Vesta? Does anyone know if Sun > > java > > > > is > > > > > > available for PPC64 architecture? > > > > > > > > > > > > Thanks, > > > > > > Ketan > > > > > > > > > > > _______________________________________________ > > > > > Swift-devel mailing list > > > > > Swift-devel at ci.uchicago.edu > > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > > > _______________________________________________ > > > > Swift-devel mailing list > > > > Swift-devel at ci.uchicago.edu > > > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > > > > > > > > > _______________________________________________ > > Swift-devel mailing list > > Swift-devel at ci.uchicago.edu > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel > > From tim.g.armstrong at gmail.com Wed Sep 3 16:49:03 2014 From: tim.g.armstrong at gmail.com (Tim Armstrong) Date: Wed, 3 Sep 2014 16:49:03 -0500 Subject: [Swift-devel] Coaster Task Submission Stalling Message-ID: I'm running a test Swift/T script that submit tasks to Coasters through the C++ client and I'm seeing some odd behaviour where task submission/execution is stalling for ~2 minute periods. For example, I'm seeing submit log messages like "submitting urn:133-1409778135377-1409778135378: /bin/hostname" in bursts of several seconds with a gap of roughly 2 minutes in between, e.g. I'm seeing bursts with the following intervals in my logs. 16:07:04,603 to 16:07:10,391 16:09:07,377 to 16:09:13,076 16:11:10,005 to 16:11:16,770 16:13:13,291 to 16:13:19,296 16:15:16,000 to 16:15:21,602 >From what I can tell, the delay is on the coaster service side: the C client is just waiting for a response. The jobs are just being submitted through the local job manager, so I wouldn't expect any delays there. The tasks are also just "/bin/hostname", so should return immediately. I'm going to continue digging into this on my own, but the 2 minute delay seems like a big clue: does anyone have an idea what could cause stalls in task submission of 2 minute duration? Cheers, Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Wed Sep 3 18:20:46 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 3 Sep 2014 16:20:46 -0700 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: References: Message-ID: <1409786446.18898.0.camel@echo> Hi Tim, I've never seen this before with pure Java. Do you have logs from these runs? Mihael On Wed, 2014-09-03 at 16:49 -0500, Tim Armstrong wrote: > I'm running a test Swift/T script that submit tasks to Coasters through the > C++ client and I'm seeing some odd behaviour where task > submission/execution is stalling for ~2 minute periods. For example, I'm > seeing submit log messages like "submitting > urn:133-1409778135377-1409778135378: /bin/hostname" in bursts of several > seconds with a gap of roughly 2 minutes in between, e.g. I'm seeing bursts > with the following intervals in my logs. > > 16:07:04,603 to 16:07:10,391 > 16:09:07,377 to 16:09:13,076 > 16:11:10,005 to 16:11:16,770 > 16:13:13,291 to 16:13:19,296 > 16:15:16,000 to 16:15:21,602 > > From what I can tell, the delay is on the coaster service side: the C > client is just waiting for a response. > > The jobs are just being submitted through the local job manager, so I > wouldn't expect any delays there. The tasks are also just "/bin/hostname", > so should return immediately. > > I'm going to continue digging into this on my own, but the 2 minute delay > seems like a big clue: does anyone have an idea what could cause stalls in > task submission of 2 minute duration? > > Cheers, > Tim From tim.g.armstrong at gmail.com Wed Sep 3 20:26:33 2014 From: tim.g.armstrong at gmail.com (Tim Armstrong) Date: Wed, 3 Sep 2014 20:26:33 -0500 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: <1409786446.18898.0.camel@echo> References: <1409786446.18898.0.camel@echo> Message-ID: Here are client and service logs, with part of service log edited down to be a reasonable size (I have the full thing if needed, but it was over a gigabyte). One relevant section is from 19:49:35 onwards. The client submits 4 jobs (its limit), but they don't complete until 19:51:32 or so (I can see that one task completed based on ncompleted=1 in the check_tasks log message). It looks like something has happened with broken pipes and workers being lost, but I'm not sure what the ultimate cause of that is likely to be. - Tim On Wed, Sep 3, 2014 at 6:20 PM, Mihael Hategan wrote: > Hi Tim, > > I've never seen this before with pure Java. > > Do you have logs from these runs? > > Mihael > > On Wed, 2014-09-03 at 16:49 -0500, Tim Armstrong wrote: > > I'm running a test Swift/T script that submit tasks to Coasters through > the > > C++ client and I'm seeing some odd behaviour where task > > submission/execution is stalling for ~2 minute periods. For example, I'm > > seeing submit log messages like "submitting > > urn:133-1409778135377-1409778135378: /bin/hostname" in bursts of several > > seconds with a gap of roughly 2 minutes in between, e.g. I'm seeing > bursts > > with the following intervals in my logs. > > > > 16:07:04,603 to 16:07:10,391 > > 16:09:07,377 to 16:09:13,076 > > 16:11:10,005 to 16:11:16,770 > > 16:13:13,291 to 16:13:19,296 > > 16:15:16,000 to 16:15:21,602 > > > > From what I can tell, the delay is on the coaster service side: the C > > client is just waiting for a response. > > > > The jobs are just being submitted through the local job manager, so I > > wouldn't expect any delays there. The tasks are also just > "/bin/hostname", > > so should return immediately. > > > > I'm going to continue digging into this on my own, but the 2 minute delay > > seems like a big clue: does anyone have an idea what could cause stalls > in > > task submission of 2 minute duration? > > > > Cheers, > > Tim > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: coaster-service.out.gz Type: application/x-gzip Size: 36069 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: swift-t-client.out.gz Type: application/x-gzip Size: 1049192 bytes Desc: not available URL: From hategan at mcs.anl.gov Wed Sep 3 22:35:22 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Wed, 3 Sep 2014 20:35:22 -0700 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: References: <1409786446.18898.0.camel@echo> Message-ID: <1409801722.21132.8.camel@echo> Ah, makes sense. 2 minutes is the channel timeout. Each live connection is guaranteed to have some communication for any 2 minute time window, partially due to periodic heartbeats (sent every 1 minute). If no packets flow for the duration of 2 minutes, the connection is assumed broken and all jobs that were submitted to the respective workers are considered failed. So there seems to be an issue with the connections to some of the workers, and it takes 2 minutes to detect them. Since the service seems to be alive (although a jstack on the service when thing seem to hang might help), this leaves two possibilities: 1 - some genuine network problem 2 - the worker died without properly closing TCP connections If (2), you could enable worker logging (Settings::Key::WORKER_LOGGING_LEVEL = "DEBUG") to see if anything shows up. Mihael On Wed, 2014-09-03 at 20:26 -0500, Tim Armstrong wrote: > Here are client and service logs, with part of service log edited down to > be a reasonable size (I have the full thing if needed, but it was over a > gigabyte). > > One relevant section is from 19:49:35 onwards. The client submits 4 jobs > (its limit), but they don't complete until 19:51:32 or so (I can see that > one task completed based on ncompleted=1 in the check_tasks log message). > It looks like something has happened with broken pipes and workers being > lost, but I'm not sure what the ultimate cause of that is likely to be. > > - Tim > > > > On Wed, Sep 3, 2014 at 6:20 PM, Mihael Hategan wrote: > > > Hi Tim, > > > > I've never seen this before with pure Java. > > > > Do you have logs from these runs? > > > > Mihael > > > > On Wed, 2014-09-03 at 16:49 -0500, Tim Armstrong wrote: > > > I'm running a test Swift/T script that submit tasks to Coasters through > > the > > > C++ client and I'm seeing some odd behaviour where task > > > submission/execution is stalling for ~2 minute periods. For example, I'm > > > seeing submit log messages like "submitting > > > urn:133-1409778135377-1409778135378: /bin/hostname" in bursts of several > > > seconds with a gap of roughly 2 minutes in between, e.g. I'm seeing > > bursts > > > with the following intervals in my logs. > > > > > > 16:07:04,603 to 16:07:10,391 > > > 16:09:07,377 to 16:09:13,076 > > > 16:11:10,005 to 16:11:16,770 > > > 16:13:13,291 to 16:13:19,296 > > > 16:15:16,000 to 16:15:21,602 > > > > > > From what I can tell, the delay is on the coaster service side: the C > > > client is just waiting for a response. > > > > > > The jobs are just being submitted through the local job manager, so I > > > wouldn't expect any delays there. The tasks are also just > > "/bin/hostname", > > > so should return immediately. > > > > > > I'm going to continue digging into this on my own, but the 2 minute delay > > > seems like a big clue: does anyone have an idea what could cause stalls > > in > > > task submission of 2 minute duration? > > > > > > Cheers, > > > Tim > > > > > > From tim.g.armstrong at gmail.com Thu Sep 4 13:11:04 2014 From: tim.g.armstrong at gmail.com (Tim Armstrong) Date: Thu, 4 Sep 2014 13:11:04 -0500 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: <1409801722.21132.8.camel@echo> References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo> Message-ID: This is all running locally on my laptop, so I think we can rule out 1). It also seems like it's a state the coaster service gets into after a few client sessions: generally the first coaster run works fine, then after a few runs the problem occurs more frequently. I'm going to try and get worker logs, in the meantime i've got some jstacks (attached). Matching service logs (largish) are here if needed: http://people.cs.uchicago.edu/~tga/service.out.gz On Wed, Sep 3, 2014 at 10:35 PM, Mihael Hategan wrote: > Ah, makes sense. > > 2 minutes is the channel timeout. Each live connection is guaranteed to > have some communication for any 2 minute time window, partially due to > periodic heartbeats (sent every 1 minute). If no packets flow for the > duration of 2 minutes, the connection is assumed broken and all jobs > that were submitted to the respective workers are considered failed. So > there seems to be an issue with the connections to some of the workers, > and it takes 2 minutes to detect them. > > Since the service seems to be alive (although a jstack on the service > when thing seem to hang might help), this leaves two possibilities: > 1 - some genuine network problem > 2 - the worker died without properly closing TCP connections > > If (2), you could enable worker logging > (Settings::Key::WORKER_LOGGING_LEVEL = "DEBUG") to see if anything shows > up. > > Mihael > > On Wed, 2014-09-03 at 20:26 -0500, Tim Armstrong wrote: > > Here are client and service logs, with part of service log edited down to > > be a reasonable size (I have the full thing if needed, but it was over a > > gigabyte). > > > > One relevant section is from 19:49:35 onwards. The client submits 4 jobs > > (its limit), but they don't complete until 19:51:32 or so (I can see that > > one task completed based on ncompleted=1 in the check_tasks log message). > > It looks like something has happened with broken pipes and workers being > > lost, but I'm not sure what the ultimate cause of that is likely to be. > > > > - Tim > > > > > > > > On Wed, Sep 3, 2014 at 6:20 PM, Mihael Hategan > wrote: > > > > > Hi Tim, > > > > > > I've never seen this before with pure Java. > > > > > > Do you have logs from these runs? > > > > > > Mihael > > > > > > On Wed, 2014-09-03 at 16:49 -0500, Tim Armstrong wrote: > > > > I'm running a test Swift/T script that submit tasks to Coasters > through > > > the > > > > C++ client and I'm seeing some odd behaviour where task > > > > submission/execution is stalling for ~2 minute periods. For > example, I'm > > > > seeing submit log messages like "submitting > > > > urn:133-1409778135377-1409778135378: /bin/hostname" in bursts of > several > > > > seconds with a gap of roughly 2 minutes in between, e.g. I'm seeing > > > bursts > > > > with the following intervals in my logs. > > > > > > > > 16:07:04,603 to 16:07:10,391 > > > > 16:09:07,377 to 16:09:13,076 > > > > 16:11:10,005 to 16:11:16,770 > > > > 16:13:13,291 to 16:13:19,296 > > > > 16:15:16,000 to 16:15:21,602 > > > > > > > > From what I can tell, the delay is on the coaster service side: the C > > > > client is just waiting for a response. > > > > > > > > The jobs are just being submitted through the local job manager, so I > > > > wouldn't expect any delays there. The tasks are also just > > > "/bin/hostname", > > > > so should return immediately. > > > > > > > > I'm going to continue digging into this on my own, but the 2 minute > delay > > > > seems like a big clue: does anyone have an idea what could cause > stalls > > > in > > > > task submission of 2 minute duration? > > > > > > > > Cheers, > > > > Tim > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: hostnames-run1.out Type: application/octet-stream Size: 310493 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: hostnames-run2.out Type: application/octet-stream Size: 4461088 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: jstack.out Type: application/octet-stream Size: 113681 bytes Desc: not available URL: From tim.g.armstrong at gmail.com Thu Sep 4 14:35:29 2014 From: tim.g.armstrong at gmail.com (Tim Armstrong) Date: Thu, 4 Sep 2014 14:35:29 -0500 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo> Message-ID: Ok, now I have some worker logs: http://people.cs.uchicago.edu/~tga/2014-9-4-worker-logs.tar.gz There's nothing obvious I see in the worker logs that would indicate why the connection was broken. - Tim On Thu, Sep 4, 2014 at 1:11 PM, Tim Armstrong wrote: > This is all running locally on my laptop, so I think we can rule out 1). > > It also seems like it's a state the coaster service gets into after a few > client sessions: generally the first coaster run works fine, then after a > few runs the problem occurs more frequently. > > I'm going to try and get worker logs, in the meantime i've got some > jstacks (attached). > > Matching service logs (largish) are here if needed: > http://people.cs.uchicago.edu/~tga/service.out.gz > > > On Wed, Sep 3, 2014 at 10:35 PM, Mihael Hategan > wrote: > >> Ah, makes sense. >> >> 2 minutes is the channel timeout. Each live connection is guaranteed to >> have some communication for any 2 minute time window, partially due to >> periodic heartbeats (sent every 1 minute). If no packets flow for the >> duration of 2 minutes, the connection is assumed broken and all jobs >> that were submitted to the respective workers are considered failed. So >> there seems to be an issue with the connections to some of the workers, >> and it takes 2 minutes to detect them. >> >> Since the service seems to be alive (although a jstack on the service >> when thing seem to hang might help), this leaves two possibilities: >> 1 - some genuine network problem >> 2 - the worker died without properly closing TCP connections >> >> If (2), you could enable worker logging >> (Settings::Key::WORKER_LOGGING_LEVEL = "DEBUG") to see if anything shows >> up. >> >> Mihael >> >> On Wed, 2014-09-03 at 20:26 -0500, Tim Armstrong wrote: >> > Here are client and service logs, with part of service log edited down >> to >> > be a reasonable size (I have the full thing if needed, but it was over a >> > gigabyte). >> > >> > One relevant section is from 19:49:35 onwards. The client submits 4 >> jobs >> > (its limit), but they don't complete until 19:51:32 or so (I can see >> that >> > one task completed based on ncompleted=1 in the check_tasks log >> message). >> > It looks like something has happened with broken pipes and workers being >> > lost, but I'm not sure what the ultimate cause of that is likely to be. >> > >> > - Tim >> > >> > >> > >> > On Wed, Sep 3, 2014 at 6:20 PM, Mihael Hategan >> wrote: >> > >> > > Hi Tim, >> > > >> > > I've never seen this before with pure Java. >> > > >> > > Do you have logs from these runs? >> > > >> > > Mihael >> > > >> > > On Wed, 2014-09-03 at 16:49 -0500, Tim Armstrong wrote: >> > > > I'm running a test Swift/T script that submit tasks to Coasters >> through >> > > the >> > > > C++ client and I'm seeing some odd behaviour where task >> > > > submission/execution is stalling for ~2 minute periods. For >> example, I'm >> > > > seeing submit log messages like "submitting >> > > > urn:133-1409778135377-1409778135378: /bin/hostname" in bursts of >> several >> > > > seconds with a gap of roughly 2 minutes in between, e.g. I'm seeing >> > > bursts >> > > > with the following intervals in my logs. >> > > > >> > > > 16:07:04,603 to 16:07:10,391 >> > > > 16:09:07,377 to 16:09:13,076 >> > > > 16:11:10,005 to 16:11:16,770 >> > > > 16:13:13,291 to 16:13:19,296 >> > > > 16:15:16,000 to 16:15:21,602 >> > > > >> > > > From what I can tell, the delay is on the coaster service side: the >> C >> > > > client is just waiting for a response. >> > > > >> > > > The jobs are just being submitted through the local job manager, so >> I >> > > > wouldn't expect any delays there. The tasks are also just >> > > "/bin/hostname", >> > > > so should return immediately. >> > > > >> > > > I'm going to continue digging into this on my own, but the 2 minute >> delay >> > > > seems like a big clue: does anyone have an idea what could cause >> stalls >> > > in >> > > > task submission of 2 minute duration? >> > > > >> > > > Cheers, >> > > > Tim >> > > >> > > >> > > >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Sep 4 15:03:06 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 4 Sep 2014 13:03:06 -0700 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo>

Message-ID: <1409860986.7960.3.camel@echo> The first worker "failing" is 0904-20022331. The log looks funny at the end. Can you git pull and re-run? The worker is getting some command at the end there and doing nothing about it and I wonder why. Mihael On Thu, 2014-09-04 at 14:35 -0500, Tim Armstrong wrote: > Ok, now I have some worker logs: > > http://people.cs.uchicago.edu/~tga/2014-9-4-worker-logs.tar.gz > > There's nothing obvious I see in the worker logs that would indicate why > the connection was broken. > > - Tim > > > On Thu, Sep 4, 2014 at 1:11 PM, Tim Armstrong > wrote: > > > This is all running locally on my laptop, so I think we can rule out 1). > > > > It also seems like it's a state the coaster service gets into after a few > > client sessions: generally the first coaster run works fine, then after a > > few runs the problem occurs more frequently. > > > > I'm going to try and get worker logs, in the meantime i've got some > > jstacks (attached). > > > > Matching service logs (largish) are here if needed: > > http://people.cs.uchicago.edu/~tga/service.out.gz > > > > > > On Wed, Sep 3, 2014 at 10:35 PM, Mihael Hategan > > wrote: > > > >> Ah, makes sense. > >> > >> 2 minutes is the channel timeout. Each live connection is guaranteed to > >> have some communication for any 2 minute time window, partially due to > >> periodic heartbeats (sent every 1 minute). If no packets flow for the > >> duration of 2 minutes, the connection is assumed broken and all jobs > >> that were submitted to the respective workers are considered failed. So > >> there seems to be an issue with the connections to some of the workers, > >> and it takes 2 minutes to detect them. > >> > >> Since the service seems to be alive (although a jstack on the service > >> when thing seem to hang might help), this leaves two possibilities: > >> 1 - some genuine network problem > >> 2 - the worker died without properly closing TCP connections > >> > >> If (2), you could enable worker logging > >> (Settings::Key::WORKER_LOGGING_LEVEL = "DEBUG") to see if anything shows > >> up. > >> > >> Mihael > >> > >> On Wed, 2014-09-03 at 20:26 -0500, Tim Armstrong wrote: > >> > Here are client and service logs, with part of service log edited down > >> to > >> > be a reasonable size (I have the full thing if needed, but it was over a > >> > gigabyte). > >> > > >> > One relevant section is from 19:49:35 onwards. The client submits 4 > >> jobs > >> > (its limit), but they don't complete until 19:51:32 or so (I can see > >> that > >> > one task completed based on ncompleted=1 in the check_tasks log > >> message). > >> > It looks like something has happened with broken pipes and workers being > >> > lost, but I'm not sure what the ultimate cause of that is likely to be. > >> > > >> > - Tim > >> > > >> > > >> > > >> > On Wed, Sep 3, 2014 at 6:20 PM, Mihael Hategan > >> wrote: > >> > > >> > > Hi Tim, > >> > > > >> > > I've never seen this before with pure Java. > >> > > > >> > > Do you have logs from these runs? > >> > > > >> > > Mihael > >> > > > >> > > On Wed, 2014-09-03 at 16:49 -0500, Tim Armstrong wrote: > >> > > > I'm running a test Swift/T script that submit tasks to Coasters > >> through > >> > > the > >> > > > C++ client and I'm seeing some odd behaviour where task > >> > > > submission/execution is stalling for ~2 minute periods. For > >> example, I'm > >> > > > seeing submit log messages like "submitting > >> > > > urn:133-1409778135377-1409778135378: /bin/hostname" in bursts of > >> several > >> > > > seconds with a gap of roughly 2 minutes in between, e.g. I'm seeing > >> > > bursts > >> > > > with the following intervals in my logs. > >> > > > > >> > > > 16:07:04,603 to 16:07:10,391 > >> > > > 16:09:07,377 to 16:09:13,076 > >> > > > 16:11:10,005 to 16:11:16,770 > >> > > > 16:13:13,291 to 16:13:19,296 > >> > > > 16:15:16,000 to 16:15:21,602 > >> > > > > >> > > > From what I can tell, the delay is on the coaster service side: the > >> C > >> > > > client is just waiting for a response. > >> > > > > >> > > > The jobs are just being submitted through the local job manager, so > >> I > >> > > > wouldn't expect any delays there. The tasks are also just > >> > > "/bin/hostname", > >> > > > so should return immediately. > >> > > > > >> > > > I'm going to continue digging into this on my own, but the 2 minute > >> delay > >> > > > seems like a big clue: does anyone have an idea what could cause > >> stalls > >> > > in > >> > > > task submission of 2 minute duration? > >> > > > > >> > > > Cheers, > >> > > > Tim > >> > > > >> > > > >> > > > >> > >> > >> > > From tim.g.armstrong at gmail.com Thu Sep 4 15:34:17 2014 From: tim.g.armstrong at gmail.com (Tim Armstrong) Date: Thu, 4 Sep 2014 15:34:17 -0500 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: <1409860986.7960.3.camel@echo> References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo>

<1409860986.7960.3.camel@echo> Message-ID: Should be here: http://people.cs.uchicago.edu/~tga/worker-logs2.tar.gz On Thu, Sep 4, 2014 at 3:03 PM, Mihael Hategan wrote: > The first worker "failing" is 0904-20022331. The log looks funny at the > end. > > Can you git pull and re-run? The worker is getting some command at the > end there and doing nothing about it and I wonder why. > > Mihael > > On Thu, 2014-09-04 at 14:35 -0500, Tim Armstrong wrote: > > Ok, now I have some worker logs: > > > > http://people.cs.uchicago.edu/~tga/2014-9-4-worker-logs.tar.gz > > > > There's nothing obvious I see in the worker logs that would indicate why > > the connection was broken. > > > > - Tim > > > > > > On Thu, Sep 4, 2014 at 1:11 PM, Tim Armstrong > > > wrote: > > > > > This is all running locally on my laptop, so I think we can rule out > 1). > > > > > > It also seems like it's a state the coaster service gets into after a > few > > > client sessions: generally the first coaster run works fine, then > after a > > > few runs the problem occurs more frequently. > > > > > > I'm going to try and get worker logs, in the meantime i've got some > > > jstacks (attached). > > > > > > Matching service logs (largish) are here if needed: > > > http://people.cs.uchicago.edu/~tga/service.out.gz > > > > > > > > > On Wed, Sep 3, 2014 at 10:35 PM, Mihael Hategan > > > wrote: > > > > > >> Ah, makes sense. > > >> > > >> 2 minutes is the channel timeout. Each live connection is guaranteed > to > > >> have some communication for any 2 minute time window, partially due to > > >> periodic heartbeats (sent every 1 minute). If no packets flow for the > > >> duration of 2 minutes, the connection is assumed broken and all jobs > > >> that were submitted to the respective workers are considered failed. > So > > >> there seems to be an issue with the connections to some of the > workers, > > >> and it takes 2 minutes to detect them. > > >> > > >> Since the service seems to be alive (although a jstack on the service > > >> when thing seem to hang might help), this leaves two possibilities: > > >> 1 - some genuine network problem > > >> 2 - the worker died without properly closing TCP connections > > >> > > >> If (2), you could enable worker logging > > >> (Settings::Key::WORKER_LOGGING_LEVEL = "DEBUG") to see if anything > shows > > >> up. > > >> > > >> Mihael > > >> > > >> On Wed, 2014-09-03 at 20:26 -0500, Tim Armstrong wrote: > > >> > Here are client and service logs, with part of service log edited > down > > >> to > > >> > be a reasonable size (I have the full thing if needed, but it was > over a > > >> > gigabyte). > > >> > > > >> > One relevant section is from 19:49:35 onwards. The client submits 4 > > >> jobs > > >> > (its limit), but they don't complete until 19:51:32 or so (I can see > > >> that > > >> > one task completed based on ncompleted=1 in the check_tasks log > > >> message). > > >> > It looks like something has happened with broken pipes and workers > being > > >> > lost, but I'm not sure what the ultimate cause of that is likely to > be. > > >> > > > >> > - Tim > > >> > > > >> > > > >> > > > >> > On Wed, Sep 3, 2014 at 6:20 PM, Mihael Hategan > > > >> wrote: > > >> > > > >> > > Hi Tim, > > >> > > > > >> > > I've never seen this before with pure Java. > > >> > > > > >> > > Do you have logs from these runs? > > >> > > > > >> > > Mihael > > >> > > > > >> > > On Wed, 2014-09-03 at 16:49 -0500, Tim Armstrong wrote: > > >> > > > I'm running a test Swift/T script that submit tasks to Coasters > > >> through > > >> > > the > > >> > > > C++ client and I'm seeing some odd behaviour where task > > >> > > > submission/execution is stalling for ~2 minute periods. For > > >> example, I'm > > >> > > > seeing submit log messages like "submitting > > >> > > > urn:133-1409778135377-1409778135378: /bin/hostname" in bursts of > > >> several > > >> > > > seconds with a gap of roughly 2 minutes in between, e.g. I'm > seeing > > >> > > bursts > > >> > > > with the following intervals in my logs. > > >> > > > > > >> > > > 16:07:04,603 to 16:07:10,391 > > >> > > > 16:09:07,377 to 16:09:13,076 > > >> > > > 16:11:10,005 to 16:11:16,770 > > >> > > > 16:13:13,291 to 16:13:19,296 > > >> > > > 16:15:16,000 to 16:15:21,602 > > >> > > > > > >> > > > From what I can tell, the delay is on the coaster service side: > the > > >> C > > >> > > > client is just waiting for a response. > > >> > > > > > >> > > > The jobs are just being submitted through the local job > manager, so > > >> I > > >> > > > wouldn't expect any delays there. The tasks are also just > > >> > > "/bin/hostname", > > >> > > > so should return immediately. > > >> > > > > > >> > > > I'm going to continue digging into this on my own, but the 2 > minute > > >> delay > > >> > > > seems like a big clue: does anyone have an idea what could cause > > >> stalls > > >> > > in > > >> > > > task submission of 2 minute duration? > > >> > > > > > >> > > > Cheers, > > >> > > > Tim > > >> > > > > >> > > > > >> > > > > >> > > >> > > >> > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Thu Sep 4 19:27:18 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Thu, 4 Sep 2014 17:27:18 -0700 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo>

<1409860986.7960.3.camel@echo> Message-ID: <1409876838.3600.8.camel@echo> Ok, so that's legit. It does look like shut down workers are not properly accounted for in some places (and I believe Yadu submitted a bug for this). However, I do not see the dead time you mention in either of the last two sets of logs. It looks like each client instance submits a continous stream of jobs. So let's get back to the initial log. Can I have the full service log? I'm trying to track what happened with the jobs submitted before the first big pause. Also, a log message in CoasterClient::updateJobStatus() (or friends) would probably help a lot here. Mihael On Thu, 2014-09-04 at 15:34 -0500, Tim Armstrong wrote: > Should be here: > > http://people.cs.uchicago.edu/~tga/worker-logs2.tar.gz > > > > > On Thu, Sep 4, 2014 at 3:03 PM, Mihael Hategan wrote: > > > The first worker "failing" is 0904-20022331. The log looks funny at the > > end. > > > > Can you git pull and re-run? The worker is getting some command at the > > end there and doing nothing about it and I wonder why. > > > > Mihael > > > > On Thu, 2014-09-04 at 14:35 -0500, Tim Armstrong wrote: > > > Ok, now I have some worker logs: > > > > > > http://people.cs.uchicago.edu/~tga/2014-9-4-worker-logs.tar.gz > > > > > > There's nothing obvious I see in the worker logs that would indicate why > > > the connection was broken. > > > > > > - Tim > > > > > > > > > On Thu, Sep 4, 2014 at 1:11 PM, Tim Armstrong > > > > > wrote: > > > > > > > This is all running locally on my laptop, so I think we can rule out > > 1). > > > > > > > > It also seems like it's a state the coaster service gets into after a > > few > > > > client sessions: generally the first coaster run works fine, then > > after a > > > > few runs the problem occurs more frequently. > > > > > > > > I'm going to try and get worker logs, in the meantime i've got some > > > > jstacks (attached). > > > > > > > > Matching service logs (largish) are here if needed: > > > > http://people.cs.uchicago.edu/~tga/service.out.gz > > > > > > > > > > > > On Wed, Sep 3, 2014 at 10:35 PM, Mihael Hategan > > > > wrote: > > > > > > > >> Ah, makes sense. > > > >> > > > >> 2 minutes is the channel timeout. Each live connection is guaranteed > > to > > > >> have some communication for any 2 minute time window, partially due to > > > >> periodic heartbeats (sent every 1 minute). If no packets flow for the > > > >> duration of 2 minutes, the connection is assumed broken and all jobs > > > >> that were submitted to the respective workers are considered failed. > > So > > > >> there seems to be an issue with the connections to some of the > > workers, > > > >> and it takes 2 minutes to detect them. > > > >> > > > >> Since the service seems to be alive (although a jstack on the service > > > >> when thing seem to hang might help), this leaves two possibilities: > > > >> 1 - some genuine network problem > > > >> 2 - the worker died without properly closing TCP connections > > > >> > > > >> If (2), you could enable worker logging > > > >> (Settings::Key::WORKER_LOGGING_LEVEL = "DEBUG") to see if anything > > shows > > > >> up. > > > >> > > > >> Mihael > > > >> > > > >> On Wed, 2014-09-03 at 20:26 -0500, Tim Armstrong wrote: > > > >> > Here are client and service logs, with part of service log edited > > down > > > >> to > > > >> > be a reasonable size (I have the full thing if needed, but it was > > over a > > > >> > gigabyte). > > > >> > > > > >> > One relevant section is from 19:49:35 onwards. The client submits 4 > > > >> jobs > > > >> > (its limit), but they don't complete until 19:51:32 or so (I can see > > > >> that > > > >> > one task completed based on ncompleted=1 in the check_tasks log > > > >> message). > > > >> > It looks like something has happened with broken pipes and workers > > being > > > >> > lost, but I'm not sure what the ultimate cause of that is likely to > > be. > > > >> > > > > >> > - Tim > > > >> > > > > >> > > > > >> > > > > >> > On Wed, Sep 3, 2014 at 6:20 PM, Mihael Hategan > > > > > >> wrote: > > > >> > > > > >> > > Hi Tim, > > > >> > > > > > >> > > I've never seen this before with pure Java. > > > >> > > > > > >> > > Do you have logs from these runs? > > > >> > > > > > >> > > Mihael > > > >> > > > > > >> > > On Wed, 2014-09-03 at 16:49 -0500, Tim Armstrong wrote: > > > >> > > > I'm running a test Swift/T script that submit tasks to Coasters > > > >> through > > > >> > > the > > > >> > > > C++ client and I'm seeing some odd behaviour where task > > > >> > > > submission/execution is stalling for ~2 minute periods. For > > > >> example, I'm > > > >> > > > seeing submit log messages like "submitting > > > >> > > > urn:133-1409778135377-1409778135378: /bin/hostname" in bursts of > > > >> several > > > >> > > > seconds with a gap of roughly 2 minutes in between, e.g. I'm > > seeing > > > >> > > bursts > > > >> > > > with the following intervals in my logs. > > > >> > > > > > > >> > > > 16:07:04,603 to 16:07:10,391 > > > >> > > > 16:09:07,377 to 16:09:13,076 > > > >> > > > 16:11:10,005 to 16:11:16,770 > > > >> > > > 16:13:13,291 to 16:13:19,296 > > > >> > > > 16:15:16,000 to 16:15:21,602 > > > >> > > > > > > >> > > > From what I can tell, the delay is on the coaster service side: > > the > > > >> C > > > >> > > > client is just waiting for a response. > > > >> > > > > > > >> > > > The jobs are just being submitted through the local job > > manager, so > > > >> I > > > >> > > > wouldn't expect any delays there. The tasks are also just > > > >> > > "/bin/hostname", > > > >> > > > so should return immediately. > > > >> > > > > > > >> > > > I'm going to continue digging into this on my own, but the 2 > > minute > > > >> delay > > > >> > > > seems like a big clue: does anyone have an idea what could cause > > > >> stalls > > > >> > > in > > > >> > > > task submission of 2 minute duration? > > > >> > > > > > > >> > > > Cheers, > > > >> > > > Tim > > > >> > > > > > >> > > > > > >> > > > > > >> > > > >> > > > >> > > > > > > > > > > From tim.g.armstrong at gmail.com Fri Sep 5 08:55:04 2014 From: tim.g.armstrong at gmail.com (Tim Armstrong) Date: Fri, 5 Sep 2014 08:55:04 -0500 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: <1409876838.3600.8.camel@echo> References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo>

<1409860986.7960.3.camel@echo> <1409876838.3600.8.camel@echo> Message-ID: It's here: http://people.cs.uchicago.edu/~tga/files/coaster-service.out.full.gz . I'll add some extra debug messages in the coaster C++ client and see if I can recreate the scenario. - Tim On Thu, Sep 4, 2014 at 7:27 PM, Mihael Hategan wrote: > Ok, so that's legit. > > It does look like shut down workers are not properly accounted for in > some places (and I believe Yadu submitted a bug for this). However, I do > not see the dead time you mention in either of the last two sets of > logs. It looks like each client instance submits a continous stream of > jobs. > > So let's get back to the initial log. Can I have the full service log? > I'm trying to track what happened with the jobs submitted before the > first big pause. > > Also, a log message in CoasterClient::updateJobStatus() (or friends) > would probably help a lot here. > > Mihael > > On Thu, 2014-09-04 at 15:34 -0500, Tim Armstrong wrote: > > Should be here: > > > > http://people.cs.uchicago.edu/~tga/worker-logs2.tar.gz > > > > > > > > > > On Thu, Sep 4, 2014 at 3:03 PM, Mihael Hategan > wrote: > > > > > The first worker "failing" is 0904-20022331. The log looks funny at the > > > end. > > > > > > Can you git pull and re-run? The worker is getting some command at the > > > end there and doing nothing about it and I wonder why. > > > > > > Mihael > > > > > > On Thu, 2014-09-04 at 14:35 -0500, Tim Armstrong wrote: > > > > Ok, now I have some worker logs: > > > > > > > > http://people.cs.uchicago.edu/~tga/2014-9-4-worker-logs.tar.gz > > > > > > > > There's nothing obvious I see in the worker logs that would indicate > why > > > > the connection was broken. > > > > > > > > - Tim > > > > > > > > > > > > On Thu, Sep 4, 2014 at 1:11 PM, Tim Armstrong < > tim.g.armstrong at gmail.com > > > > > > > > wrote: > > > > > > > > > This is all running locally on my laptop, so I think we can rule > out > > > 1). > > > > > > > > > > It also seems like it's a state the coaster service gets into > after a > > > few > > > > > client sessions: generally the first coaster run works fine, then > > > after a > > > > > few runs the problem occurs more frequently. > > > > > > > > > > I'm going to try and get worker logs, in the meantime i've got some > > > > > jstacks (attached). > > > > > > > > > > Matching service logs (largish) are here if needed: > > > > > http://people.cs.uchicago.edu/~tga/service.out.gz > > > > > > > > > > > > > > > On Wed, Sep 3, 2014 at 10:35 PM, Mihael Hategan < > hategan at mcs.anl.gov> > > > > > wrote: > > > > > > > > > >> Ah, makes sense. > > > > >> > > > > >> 2 minutes is the channel timeout. Each live connection is > guaranteed > > > to > > > > >> have some communication for any 2 minute time window, partially > due to > > > > >> periodic heartbeats (sent every 1 minute). If no packets flow for > the > > > > >> duration of 2 minutes, the connection is assumed broken and all > jobs > > > > >> that were submitted to the respective workers are considered > failed. > > > So > > > > >> there seems to be an issue with the connections to some of the > > > workers, > > > > >> and it takes 2 minutes to detect them. > > > > >> > > > > >> Since the service seems to be alive (although a jstack on the > service > > > > >> when thing seem to hang might help), this leaves two > possibilities: > > > > >> 1 - some genuine network problem > > > > >> 2 - the worker died without properly closing TCP connections > > > > >> > > > > >> If (2), you could enable worker logging > > > > >> (Settings::Key::WORKER_LOGGING_LEVEL = "DEBUG") to see if anything > > > shows > > > > >> up. > > > > >> > > > > >> Mihael > > > > >> > > > > >> On Wed, 2014-09-03 at 20:26 -0500, Tim Armstrong wrote: > > > > >> > Here are client and service logs, with part of service log > edited > > > down > > > > >> to > > > > >> > be a reasonable size (I have the full thing if needed, but it > was > > > over a > > > > >> > gigabyte). > > > > >> > > > > > >> > One relevant section is from 19:49:35 onwards. The client > submits 4 > > > > >> jobs > > > > >> > (its limit), but they don't complete until 19:51:32 or so (I > can see > > > > >> that > > > > >> > one task completed based on ncompleted=1 in the check_tasks log > > > > >> message). > > > > >> > It looks like something has happened with broken pipes and > workers > > > being > > > > >> > lost, but I'm not sure what the ultimate cause of that is > likely to > > > be. > > > > >> > > > > > >> > - Tim > > > > >> > > > > > >> > > > > > >> > > > > > >> > On Wed, Sep 3, 2014 at 6:20 PM, Mihael Hategan < > hategan at mcs.anl.gov > > > > > > > > >> wrote: > > > > >> > > > > > >> > > Hi Tim, > > > > >> > > > > > > >> > > I've never seen this before with pure Java. > > > > >> > > > > > > >> > > Do you have logs from these runs? > > > > >> > > > > > > >> > > Mihael > > > > >> > > > > > > >> > > On Wed, 2014-09-03 at 16:49 -0500, Tim Armstrong wrote: > > > > >> > > > I'm running a test Swift/T script that submit tasks to > Coasters > > > > >> through > > > > >> > > the > > > > >> > > > C++ client and I'm seeing some odd behaviour where task > > > > >> > > > submission/execution is stalling for ~2 minute periods. For > > > > >> example, I'm > > > > >> > > > seeing submit log messages like "submitting > > > > >> > > > urn:133-1409778135377-1409778135378: /bin/hostname" in > bursts of > > > > >> several > > > > >> > > > seconds with a gap of roughly 2 minutes in between, e.g. I'm > > > seeing > > > > >> > > bursts > > > > >> > > > with the following intervals in my logs. > > > > >> > > > > > > > >> > > > 16:07:04,603 to 16:07:10,391 > > > > >> > > > 16:09:07,377 to 16:09:13,076 > > > > >> > > > 16:11:10,005 to 16:11:16,770 > > > > >> > > > 16:13:13,291 to 16:13:19,296 > > > > >> > > > 16:15:16,000 to 16:15:21,602 > > > > >> > > > > > > > >> > > > From what I can tell, the delay is on the coaster service > side: > > > the > > > > >> C > > > > >> > > > client is just waiting for a response. > > > > >> > > > > > > > >> > > > The jobs are just being submitted through the local job > > > manager, so > > > > >> I > > > > >> > > > wouldn't expect any delays there. The tasks are also just > > > > >> > > "/bin/hostname", > > > > >> > > > so should return immediately. > > > > >> > > > > > > > >> > > > I'm going to continue digging into this on my own, but the 2 > > > minute > > > > >> delay > > > > >> > > > seems like a big clue: does anyone have an idea what could > cause > > > > >> stalls > > > > >> > > in > > > > >> > > > task submission of 2 minute duration? > > > > >> > > > > > > > >> > > > Cheers, > > > > >> > > > Tim > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > >> > > > > >> > > > > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tim.g.armstrong at gmail.com Fri Sep 5 12:13:02 2014 From: tim.g.armstrong at gmail.com (Tim Armstrong) Date: Fri, 5 Sep 2014 12:13:02 -0500 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo>

<1409860986.7960.3.camel@echo> <1409876838.3600.8.camel@echo> Message-ID: Ok, here it is with the additional debug messages. Source code change is in commit 890c41f2ba701b10264553471590096d6f94c278. Warning: the tarball will expand to several gigabytes of logs. I had to do multiple client runs to trigger it. It seems like the problem might be triggered by abnormal termination of the client. First 18 runs went fine, problem only started when I ctrl-c-ed the swift/t run #19 before the run #20 that exhibited delays. http://people.cs.uchicago.edu/~tga/files/worker-logs3.tar.gz - Tim On Fri, Sep 5, 2014 at 8:55 AM, Tim Armstrong wrote: > It's here: > http://people.cs.uchicago.edu/~tga/files/coaster-service.out.full.gz . > > I'll add some extra debug messages in the coaster C++ client and see if I > can recreate the scenario. > > - Tim > > > On Thu, Sep 4, 2014 at 7:27 PM, Mihael Hategan > wrote: > >> Ok, so that's legit. >> >> It does look like shut down workers are not properly accounted for in >> some places (and I believe Yadu submitted a bug for this). However, I do >> not see the dead time you mention in either of the last two sets of >> logs. It looks like each client instance submits a continous stream of >> jobs. >> >> So let's get back to the initial log. Can I have the full service log? >> I'm trying to track what happened with the jobs submitted before the >> first big pause. >> >> Also, a log message in CoasterClient::updateJobStatus() (or friends) >> would probably help a lot here. >> >> Mihael >> >> On Thu, 2014-09-04 at 15:34 -0500, Tim Armstrong wrote: >> > Should be here: >> > >> > http://people.cs.uchicago.edu/~tga/worker-logs2.tar.gz >> > >> > >> > >> > >> > On Thu, Sep 4, 2014 at 3:03 PM, Mihael Hategan >> wrote: >> > >> > > The first worker "failing" is 0904-20022331. The log looks funny at >> the >> > > end. >> > > >> > > Can you git pull and re-run? The worker is getting some command at the >> > > end there and doing nothing about it and I wonder why. >> > > >> > > Mihael >> > > >> > > On Thu, 2014-09-04 at 14:35 -0500, Tim Armstrong wrote: >> > > > Ok, now I have some worker logs: >> > > > >> > > > http://people.cs.uchicago.edu/~tga/2014-9-4-worker-logs.tar.gz >> > > > >> > > > There's nothing obvious I see in the worker logs that would >> indicate why >> > > > the connection was broken. >> > > > >> > > > - Tim >> > > > >> > > > >> > > > On Thu, Sep 4, 2014 at 1:11 PM, Tim Armstrong < >> tim.g.armstrong at gmail.com >> > > > >> > > > wrote: >> > > > >> > > > > This is all running locally on my laptop, so I think we can rule >> out >> > > 1). >> > > > > >> > > > > It also seems like it's a state the coaster service gets into >> after a >> > > few >> > > > > client sessions: generally the first coaster run works fine, then >> > > after a >> > > > > few runs the problem occurs more frequently. >> > > > > >> > > > > I'm going to try and get worker logs, in the meantime i've got >> some >> > > > > jstacks (attached). >> > > > > >> > > > > Matching service logs (largish) are here if needed: >> > > > > http://people.cs.uchicago.edu/~tga/service.out.gz >> > > > > >> > > > > >> > > > > On Wed, Sep 3, 2014 at 10:35 PM, Mihael Hategan < >> hategan at mcs.anl.gov> >> > > > > wrote: >> > > > > >> > > > >> Ah, makes sense. >> > > > >> >> > > > >> 2 minutes is the channel timeout. Each live connection is >> guaranteed >> > > to >> > > > >> have some communication for any 2 minute time window, partially >> due to >> > > > >> periodic heartbeats (sent every 1 minute). If no packets flow >> for the >> > > > >> duration of 2 minutes, the connection is assumed broken and all >> jobs >> > > > >> that were submitted to the respective workers are considered >> failed. >> > > So >> > > > >> there seems to be an issue with the connections to some of the >> > > workers, >> > > > >> and it takes 2 minutes to detect them. >> > > > >> >> > > > >> Since the service seems to be alive (although a jstack on the >> service >> > > > >> when thing seem to hang might help), this leaves two >> possibilities: >> > > > >> 1 - some genuine network problem >> > > > >> 2 - the worker died without properly closing TCP connections >> > > > >> >> > > > >> If (2), you could enable worker logging >> > > > >> (Settings::Key::WORKER_LOGGING_LEVEL = "DEBUG") to see if >> anything >> > > shows >> > > > >> up. >> > > > >> >> > > > >> Mihael >> > > > >> >> > > > >> On Wed, 2014-09-03 at 20:26 -0500, Tim Armstrong wrote: >> > > > >> > Here are client and service logs, with part of service log >> edited >> > > down >> > > > >> to >> > > > >> > be a reasonable size (I have the full thing if needed, but it >> was >> > > over a >> > > > >> > gigabyte). >> > > > >> > >> > > > >> > One relevant section is from 19:49:35 onwards. The client >> submits 4 >> > > > >> jobs >> > > > >> > (its limit), but they don't complete until 19:51:32 or so (I >> can see >> > > > >> that >> > > > >> > one task completed based on ncompleted=1 in the check_tasks log >> > > > >> message). >> > > > >> > It looks like something has happened with broken pipes and >> workers >> > > being >> > > > >> > lost, but I'm not sure what the ultimate cause of that is >> likely to >> > > be. >> > > > >> > >> > > > >> > - Tim >> > > > >> > >> > > > >> > >> > > > >> > >> > > > >> > On Wed, Sep 3, 2014 at 6:20 PM, Mihael Hategan < >> hategan at mcs.anl.gov >> > > > >> > > > >> wrote: >> > > > >> > >> > > > >> > > Hi Tim, >> > > > >> > > >> > > > >> > > I've never seen this before with pure Java. >> > > > >> > > >> > > > >> > > Do you have logs from these runs? >> > > > >> > > >> > > > >> > > Mihael >> > > > >> > > >> > > > >> > > On Wed, 2014-09-03 at 16:49 -0500, Tim Armstrong wrote: >> > > > >> > > > I'm running a test Swift/T script that submit tasks to >> Coasters >> > > > >> through >> > > > >> > > the >> > > > >> > > > C++ client and I'm seeing some odd behaviour where task >> > > > >> > > > submission/execution is stalling for ~2 minute periods. >> For >> > > > >> example, I'm >> > > > >> > > > seeing submit log messages like "submitting >> > > > >> > > > urn:133-1409778135377-1409778135378: /bin/hostname" in >> bursts of >> > > > >> several >> > > > >> > > > seconds with a gap of roughly 2 minutes in between, e.g. >> I'm >> > > seeing >> > > > >> > > bursts >> > > > >> > > > with the following intervals in my logs. >> > > > >> > > > >> > > > >> > > > 16:07:04,603 to 16:07:10,391 >> > > > >> > > > 16:09:07,377 to 16:09:13,076 >> > > > >> > > > 16:11:10,005 to 16:11:16,770 >> > > > >> > > > 16:13:13,291 to 16:13:19,296 >> > > > >> > > > 16:15:16,000 to 16:15:21,602 >> > > > >> > > > >> > > > >> > > > From what I can tell, the delay is on the coaster service >> side: >> > > the >> > > > >> C >> > > > >> > > > client is just waiting for a response. >> > > > >> > > > >> > > > >> > > > The jobs are just being submitted through the local job >> > > manager, so >> > > > >> I >> > > > >> > > > wouldn't expect any delays there. The tasks are also just >> > > > >> > > "/bin/hostname", >> > > > >> > > > so should return immediately. >> > > > >> > > > >> > > > >> > > > I'm going to continue digging into this on my own, but the >> 2 >> > > minute >> > > > >> delay >> > > > >> > > > seems like a big clue: does anyone have an idea what could >> cause >> > > > >> stalls >> > > > >> > > in >> > > > >> > > > task submission of 2 minute duration? >> > > > >> > > > >> > > > >> > > > Cheers, >> > > > >> > > > Tim >> > > > >> > > >> > > > >> > > >> > > > >> > > >> > > > >> >> > > > >> >> > > > >> >> > > > > >> > > >> > > >> > > >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hategan at mcs.anl.gov Fri Sep 5 12:57:18 2014 From: hategan at mcs.anl.gov (Mihael Hategan) Date: Fri, 5 Sep 2014 10:57:18 -0700 Subject: [Swift-devel] Coaster Task Submission Stalling In-Reply-To: References: <1409786446.18898.0.camel@echo> <1409801722.21132.8.camel@echo>