[Swift-devel] Unexpected messages from coasters on BG/P

Michael Wilde wilde at mcs.anl.gov
Wed Nov 11 13:06:44 CST 2009



On 11/11/09 12:40 PM, Mihael Hategan wrote:
> On Wed, 2009-11-11 at 10:38 -0600, Michael Wilde wrote:
>> Mihael, can you tell me what the messages below mean?
>>
>> - the block ended prematurely message
> 
> That says that the block job completed before being commanded to shut
> down. It's very likely that workers didn't even get started. It usually
> indicates a problem with the queue parameters (maybe you forgot
> kernel=zeptoos), but it's hard to tell without looking at cobalt logs.
> It is also not a problem that cqsub would complain about, since this
> only happens when the job is successfully queued.

Here is my sites.xml pool element:

   <pool handle="surveyor">
     <filesystem provider="local" />
     <execution provider="coaster" jobmanager="local:cobalt"/>
     <profile namespace="globus" key="slots">10</profile>
     <profile namespace="globus" key="nodeGranularity">64</profile>
     <profile namespace="globus" key="workersPerNode">4</profile>
     <profile namespace="globus" key="maxNodes">64</profile>
     <profile namespace="globus" key="project">HTCScienceApps</profile>
     <profile namespace="globus" key="kernelprofile">zeptoos</profile>
     <profile namespace="globus" key="maxtime">3000</profile>
     <profile namespace="globus" key="alcfbgpnat">true</profile>
     <profile namespace="karajan" key="initialScore">100000</profile>
     <workdirectory >/home/wilde/swiftwork</workdirectory>
     <scratch>/scratch</scratch>
   </pool>

I copied it from one you posted and have not yet tuned it for a small test.

> 
>> - the long java tracebacks (seems like one per each of 256 jobs?
> 
> That tells that the coaster provider doesn't yet implement job
> canceling. Normally, this doesn't pop up. But if you have replication
> enabled, and jobs get a chance to get replicated, you will see these
> when the copies start to run.

I'll do my next runs with replication and retry off and try to shrink 
and replicate the problem with less noise.

Thanks,

Mike


> 
> You should disable replication. It's useless if only running on the
> BG/P. In fact, the system should disable it automatically for
> applications that are only present on one site.
> 



More information about the Swift-devel mailing list