[Swift-devel] SGE provider error parsing qstat output

Michael Wilde wilde at mcs.anl.gov
Sun Nov 14 22:42:14 CST 2010


was: Re: Provenance DB Diagrams

Hi David,

The first bug seems very familiar, and I thought Mihael fixed it once.

qstat was giving slightly different output between older versions (eg I think sisboombah) and later ones (eg Ranger).

Perhaps this is a different manifestation of similar problems in command output format variations?  Feel free to dig into the code. Do you have access to an SGE system where this works?  (Let me know if you need access to the UC IBI Cluster; also try godzilla or ranger).

Regarding the coaster error: whats happening here is that the PE is not being passed from the coaster pool attributes to the attributes of the SGE jobs that the coaster provider is creating.

Do you have access to Ranger? I have a fix for this there that needs to be tested and integrated into trunk.

Marc Parisien of UChicago IBD is trying to run coasters on the IBI cluster, and getting the same error. If you could find and integrate the fix that would be great.

I attach my mods from Ranger. I think the mods related to "coresPerNode" can be removed as hopefully Mihael's PPN fix addresses them.  Whats needed is just the code that passes PE from the coasters pool to the job spces of the jSGE jobs it creates.

My svn diff is below and modified files are attached.

This was done on the stable branch, but the SGE provider has since been moved to trunk.

You should either get guidance from Mihael on this, or discuss with him if you'd rather he make the fix.

>From svn status:

M      modules/provider-localscheduler/src/org/globus/cog/abstraction/impl/scheduler/sge/SGEExecutor.java
M      modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Settings.java
M      modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/BlockTask.java

I attach these three files. Just look for the changes that propagate the PE setting.

Ignore the coresPerNode changes.

There were also changes to ensure that the provider starts the right number of workers per node, which should now always be one copy of worker.pl whose parallelism is controlled by workersPerNode.

The PPN setting should ensure that the job gets the expected number of cores allocated, for systems that do node sharing of jobs.

from svn diff: (lots of junk below from my experiments):

login3$ svn diff
Index: modules/provider-localscheduler/src/org/globus/cog/abstraction/impl/scheduler/sge/SGEExecutor.java
===================================================================
--- modules/provider-localscheduler/src/org/globus/cog/abstraction/impl/scheduler/sge/SGEExecutor.java	(revision 2734)
+++ modules/provider-localscheduler/src/org/globus/cog/abstraction/impl/scheduler/sge/SGEExecutor.java	(working copy)
@@ -52,6 +52,10 @@
         writeAttr(attrName, arg, wr, null);
     }
 
+    protected void writeAttrValue(String value, String arg, Writer wr ) throws IOException {
+        wr.write("#$ " + arg + value + '\n');
+    }
+
     protected void writeWallTime(Writer wr) throws IOException {
         Object walltime = getSpec().getAttribute("maxwalltime");
         if (walltime != null) {
@@ -77,10 +81,32 @@
         wr.write("#$ -V\n");
         writeAttr("project", "-A ", wr);
 
-        writeAttr("count", "-pe "
+// FIXME: testing this change: MW
+
+        Object countValue = getSpec().getAttribute("count");
+	int count;
+
+        if (countValue != null)
+            count = Integer.valueOf(String.valueOf(countValue)).intValue();
+	else 
+	    count = 1;
+
+        // FIXME: wpn is only meaningful for coasters; is 1 ok otherwise?
+	// should we flag wpn as error if not coasters?
+
+	Object wpnValue = getAttribute(spec, "workerspernode", "1");
+	int wpn = Integer.valueOf(String.valueOf(wpnValue)).intValue();
+	logger.info("FETCH OF WPN: " + wpn); // FIXME: DB
+
+	count *= wpn;
+	logger.info("FETCH OF PE: " + getAttribute(spec, "pe", "NO pe"));
+	logger.info("FETCH OF CPN: " + getAttribute(spec, "corespernode", "NO cpn"));
+        writeAttrValue(String.valueOf(count), "-pe "
                 + getAttribute(spec, "pe", getSGEProperties().getDefaultPE())
-                + " ", wr, "1");
+                + " ", wr);
 
+// FIXME: END OF MW CHANGE
+
         writeWallTime(wr);
         writeAttr("queue", "-q ", wr);
         if (spec.getStdInput() != null) {
@@ -157,7 +183,8 @@
     
     protected void writeMultiJobPreamble(Writer wr, String exitcodefile)
             throws IOException {
-        wr.write("NODES=`cat $PE_HOSTFILE | awk '{ for(i=0;i<$2;i++){print $1} }'`\n");
+// FIXME:MW        wr.write("NODES=`cat $PE_HOSTFILE | awk '{ for(i=0;i<$2;i++){print $1} }'`\n");
+        wr.write("NODES=`cat $PE_HOSTFILE | awk '{ print $1 }'`\n");
         wr.write("ECF=" + exitcodefile + "\n");
         wr.write("INDEX=0\n");
         wr.write("for NODE in $NODES; do\n");
@@ -188,13 +215,21 @@
         return (Properties) getProperties();
     }
     
-    public static final Pattern JOB_ID_LINE = Pattern.compile("[Yy]our job (\\d+) \\(.*\\) has been submitted");
+     public static final Pattern JOB_ID_LINE = Pattern.compile(".*[Yy]our job (\\d+) \\(.*\\) has been submitted");
+    // public static final Pattern JOB_ID_LINE = Pattern.compile("[Yy]our job (\\d+) \\(.*\\) has been submitted");
+    // public static final Pattern JOB_ID_LINE = Pattern.compile("[Yy]our job \\([0-9]\\+\\) .* has been submitted");
 
     protected String parseSubmitCommandOutput(String out) throws IOException {
         // > your job 2494189 ("t1.sub") has been submitted
         BufferedReader br = new BufferedReader(new CharArrayReader(out.toCharArray()));
         String line = br.readLine();
+	if (logger.isInfoEnabled()) {
+	    logger.info("parseSubmitCommandOutput: out=" + out);
+	}
         while (line != null) {
+	    if (logger.isInfoEnabled()) {
+		logger.info("parseSubmitCommandOutput: line=" + line);
+	    }
             Matcher m = JOB_ID_LINE.matcher(line);
             if (m.matches()) {
                 String id = m.group(1);
Index: modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Settings.java
===================================================================
--- modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Settings.java	(revision 2734)
+++ modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/Settings.java	(working copy)
@@ -22,16 +22,17 @@
     public static final Logger logger = Logger.getLogger(Settings.class);
 
     public static final String[] NAMES =
-            new String[] { "slots", "workersPerNode", "nodeGranularity", "allocationStepSize",
+            new String[] { "slots", "workersPerNode", "coresPerNode", "nodeGranularity", "allocationStepSize",
                     "maxNodes", "lowOverallocation", "highOverallocation",
                     "overallocationDecayFactor", "spread", "reserve", "maxtime", "project",
-                    "queue", "remoteMonitorEnabled", "kernelprofile", "alcfbgpnat", "internalHostname" };
+                    "queue", "remoteMonitorEnabled", "kernelprofile", "alcfbgpnat", "internalHostname", "pe" };
 
     /**
      * The maximum number of blocks that can be active at one time
      */
     private int slots = 20;
     private int workersPerNode = 1;
+    private int coresPerNode = 1;
     /**
      * How many nodes to allocate at once
      */
@@ -90,6 +91,8 @@
 
     private String queue;
 
+    private String pe;
+
     private String kernelprofile;
 
     private boolean alcfbgpnat;
@@ -116,6 +119,14 @@
         this.workersPerNode = workersPerNode;
     }
 
+    public int getCoresPerNode() {
+        return coresPerNode;
+    }
+
+    public void setCoresPerNode(int coresPerNode) {
+        this.coresPerNode = coresPerNode;
+    }
+
     public int getNodeGranularity() {
         return nodeGranularity;
     }
@@ -273,6 +284,14 @@
         this.queue = queue;
     }
 
+    public String getPe() {
+        return pe;
+    }
+
+    public void setPe(String pe) {
+        this.pe = pe;
+    }
+
     public boolean getRemoteMonitorEnabled() {
         return remoteMonitorEnabled;
     }
Index: modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/BlockTask.java
===================================================================
--- modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/BlockTask.java	(revision 2734)
+++ modules/provider-coaster/src/org/globus/cog/abstraction/coaster/service/job/manager/BlockTask.java	(working copy)
@@ -40,6 +40,13 @@
         setAttribute(spec, "maxwalltime", WallTime.format((int) block.getWalltime().getSeconds()));
         setAttribute(spec, "queue", settings.getQueue());
         setAttribute(spec, "project", settings.getProject());
+
+	// added - mw:
+        setAttribute(spec, "coresPerNode", String.valueOf(settings.getCoresPerNode()));
+        setAttribute(spec, "workersPerNode", String.valueOf(settings.getWorkersPerNode()));
+        setAttribute(spec, "pe", settings.getPe());
+	// end additions - mw ^^^
+
         int count = block.getWorkerCount() / settings.getWorkersPerNode();
         if (count > 1) {
             setAttribute(spec, "jobType", "multiple");
login3$ 

--



- Mike


 
----- Original Message -----
> Hello Mike,
> 
> I am working on adding more tests to the automated test suite. I am
> running into some issues when trying to run swift with SGE on
> sisboombah. The tests I wrote are based on the example configurations
> you sent out to the list earlier. Here is what is happening.
> 
> I am running the SGE local test with the following config file:
> 
> <config>
> <pool handle="sge-local">
> <execution provider="sge" url="none" />
> <profile namespace="globus" key="pe">threaded</profile>
> <profile key="jobThrottle" namespace="karajan">.49</profile>
> <profile namespace="karajan" key="initialScore">10000</profile>
> <filesystem provider="local" url="none" />
> <workdirectory>/home/dk0966/swiftwork</workdirectory>
> </pool>
> </config>
> 
> The error I am seeing is:
> Caused by:
> java.io.IOException: Failed to parse qstat line: 623018 0.55500
> SteadyShea xinliang r 11/13/2010 14:09:05 all.q at node1
> 1
> 
> The next test I try on this machine is SGE-coasters with the following
> config:
> 
> <config>
> <pool handle="sge-coasters">
> <execution provider="coaster" url="none" jobmanager="local:sge"/>
> <profile namespace="globus" key="pe">threaded</profile>
> <profile namespace="globus" key="workersPerNode">4</profile>
> <profile namespace="globus" key="slots">128</profile>
> <profile namespace="globus" key="nodeGranularity">1</profile>
> <profile namespace="globus" key="maxnodes">1</profile>
> <profile namespace="karajan" key="jobThrottle">5.11</profile>
> <profile namespace="karajan" key="initialScore">10000</profile>
> <filesystem provider="local" url="none"/>
> <workdirectory>/home/dk0966/swiftwork</workdirectory>
> </pool>
> </config>
> 
> For which I get the following error:
> Worker task failed: Error submitting block task
> org.globus.cog.abstraction.impl.common.task.TaskSubmissionException:
> Cannot submit job: Could not submit job (qsub reported an exit code of
> 1).
> Unable to run job: job rejected: the requested parallel environment
> "16way" does not exist.Exiting.
> 
> I couldn't find much information about SGE setups either in the guide
> or the cookbook. Is there anything else I am missing to get this up
> and running?
> 
> Thanks,
> David

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory

-------------- next part --------------
A non-text attachment was scrubbed...
Name: sgecoastermods.tar
Type: application/x-tar
Size: 30720 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20101114/6f7bb14c/attachment.tar>


More information about the Swift-devel mailing list