[Swift-devel] Unexpected messages from coasters on BG/P
Michael Wilde
wilde at mcs.anl.gov
Wed Nov 11 13:06:44 CST 2009
On 11/11/09 12:40 PM, Mihael Hategan wrote:
> On Wed, 2009-11-11 at 10:38 -0600, Michael Wilde wrote:
>> Mihael, can you tell me what the messages below mean?
>>
>> - the block ended prematurely message
>
> That says that the block job completed before being commanded to shut
> down. It's very likely that workers didn't even get started. It usually
> indicates a problem with the queue parameters (maybe you forgot
> kernel=zeptoos), but it's hard to tell without looking at cobalt logs.
> It is also not a problem that cqsub would complain about, since this
> only happens when the job is successfully queued.
Here is my sites.xml pool element:
<pool handle="surveyor">
<filesystem provider="local" />
<execution provider="coaster" jobmanager="local:cobalt"/>
<profile namespace="globus" key="slots">10</profile>
<profile namespace="globus" key="nodeGranularity">64</profile>
<profile namespace="globus" key="workersPerNode">4</profile>
<profile namespace="globus" key="maxNodes">64</profile>
<profile namespace="globus" key="project">HTCScienceApps</profile>
<profile namespace="globus" key="kernelprofile">zeptoos</profile>
<profile namespace="globus" key="maxtime">3000</profile>
<profile namespace="globus" key="alcfbgpnat">true</profile>
<profile namespace="karajan" key="initialScore">100000</profile>
<workdirectory >/home/wilde/swiftwork</workdirectory>
<scratch>/scratch</scratch>
</pool>
I copied it from one you posted and have not yet tuned it for a small test.
>
>> - the long java tracebacks (seems like one per each of 256 jobs?
>
> That tells that the coaster provider doesn't yet implement job
> canceling. Normally, this doesn't pop up. But if you have replication
> enabled, and jobs get a chance to get replicated, you will see these
> when the copies start to run.
I'll do my next runs with replication and retry off and try to shrink
and replicate the problem with less noise.
Thanks,
Mike
>
> You should disable replication. It's useless if only running on the
> BG/P. In fact, the system should disable it automatically for
> applications that are only present on one site.
>
More information about the Swift-devel
mailing list