<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Generator" content="Microsoft Word 14 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"\@Adobe Song Std L";
panose-1:0 0 0 0 0 0 0 0 0 0;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:Consolas;
panose-1:2 11 6 9 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri","sans-serif";}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:purple;
text-decoration:underline;}
p.MsoPlainText, li.MsoPlainText, div.MsoPlainText
{mso-style-priority:99;
mso-style-link:"Plain Text Char";
margin:0in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri","sans-serif";}
span.PlainTextChar
{mso-style-name:"Plain Text Char";
mso-style-priority:99;
mso-style-link:"Plain Text";
font-family:"Calibri","sans-serif";}
.MsoChpDefault
{mso-style-type:export-only;
font-family:"Calibri","sans-serif";}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-US" link="blue" vlink="purple">
<div class="WordSection1">
<p class="MsoPlainText">Mihael, I've been struggling with the runs for the past few days. I've managed to push some of them through but the majority gets so many errors that they appear to stall out. Below is an example of the stdout output from Swift:<o:p></o:p></p>
<p class="MsoPlainText" style="margin-left:.5in">RunID: 20140527-1533-6dt18a4b<o:p></o:p></p>
<p class="MsoPlainText" style="margin-left:.5in">Progress: time: Tue, 27 May 2014 15:33:52 -0700<o:p></o:p></p>
<p class="MsoPlainText" style="margin-left:.5in">Progress: time: Tue, 27 May 2014 15:33:54 -0700 Stage in:1 Submitted:2<o:p></o:p></p>
<p class="MsoPlainText" style="margin-left:.5in">Progress: time: Tue, 27 May 2014 15:33:55 -0700 Active:2 Finished successfully:1<o:p></o:p></p>
<p class="MsoPlainText" style="margin-left:.5in">Progress: time: Tue, 27 May 2014 15:33:56 -0700 Initializing:40 Active:2 Finished successfully:1<o:p></o:p></p>
<p class="MsoPlainText" style="margin-left:.5in">Progress: time: Tue, 27 May 2014 15:33:57 -0700 Initializing:352 Selecting site:241 Active:2 Finished successfully:1<o:p></o:p></p>
<p class="MsoPlainText" style="margin-left:.5in">Progress: time: Tue, 27 May 2014 15:33:58 -0700 Selecting site:1674 Submitting:326 Active:2 Finished successfully:1<o:p></o:p></p>
<p class="MsoPlainText" style="margin-left:.5in">Progress: time: Tue, 27 May 2014 15:34:00 -0700 Selecting site:1601 Stage in:2 Submitted:397 Active:2 Finished successfully:1<o:p></o:p></p>
<p class="MsoPlainText" style="margin-left:.5in">...<o:p></o:p></p>
<p class="MsoPlainText" style="margin-left:.5in">Progress: time: Tue, 27 May 2014 15:39:42 -0700 Selecting site:1268 Stage in:30 Submitted:328 Active:31 Finished successfully:3 Failed but can retry:344<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText">Digging through the logs, I've found the following mention of an error in one of my worker logs (attached):<o:p></o:p></p>
<p class="MsoPlainText" style="margin-left:.5in">2014/05/27 15:34:56.648 INFO 000000 1401230034461 Staging out /p/lscratche/bronevet/swift/work/psuadeExperiments-20140527-1533-6dt18a4b/jobs/1/runModel-1r0az8rl/wrapper.error (mode = 2).<o:p></o:p></p>
<p class="MsoPlainText" style="margin-left:.5in">…<o:p></o:p></p>
<p class="MsoPlainText" style="margin-left:.5in">2014/05/27 15:34:56.659 INFO 000000 1401230034458 Job dir total 17<o:p></o:p></p>
<p class="MsoPlainText" style="margin-left:.5in">drwx------ 2 bronevet bronevet 7168 May 27 15:34 .<o:p></o:p></p>
<p class="MsoPlainText" style="margin-left:.5in">drwx------ 5 bronevet bronevet 7168 May 27 15:34 ..<o:p></o:p></p>
<p class="MsoPlainText" style="margin-left:.5in">-rw------- 1 bronevet bronevet 6078 May 27 15:34 _swiftwrap.staging<o:p></o:p></p>
<p class="MsoPlainText" style="margin-left:.5in">-rw------- 1 bronevet bronevet 96 May 27 15:34 out.expID_0<o:p></o:p></p>
<p class="MsoPlainText" style="margin-left:.5in">-rw------- 1 bronevet bronevet 1199 May 27 15:34 out.solver_bicg.precond_diag.mtx_nasa1824.mt_0.fm_0.lm_0.ap_-1.5515515515515510e-01.am_-2.1171171171171173e+00.psp_-3.0330330330330328e+00.psm_-2.2472472472472473e+00.cprob_1e-10.block_0<o:p></o:p></p>
<p class="MsoPlainText" style="margin-left:.5in">-rw------- 1 bronevet bronevet 0 May 27 15:34 stderr.txt<o:p></o:p></p>
<p class="MsoPlainText" style="margin-left:.5in">-rw------- 1 bronevet bronevet 103 May 27 15:34 wrapper.error<o:p></o:p></p>
<p class="MsoPlainText" style="margin-left:.5in">-rw------- 1 bronevet bronevet 32501 May 27 15:34 wrapper.log<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText">Also, I saw some SLURM stdout files the said that I’m out of space.
<o:p></o:p></p>
<p class="MsoPlainText" style="text-indent:.5in">cat /g/g15/bronevet/.globus/scripts/Slurm1575966868019932992.submit.stdout<o:p></o:p></p>
<p class="MsoPlainText" style="margin-left:.5in">env: write error: No space left on device<o:p></o:p></p>
<p class="MsoPlainText" style="margin-left:.5in">df: write error: No space left on device<o:p></o:p></p>
<p class="MsoPlainText" style="margin-left:.5in">cat: write error: No space left on device<o:p></o:p></p>
<p class="MsoPlainText" style="margin-left:.5in">cat: write error: No space left on device<o:p></o:p></p>
<p class="MsoPlainText" style="margin-left:.5in">_swiftwrap.staging: line 45: echo: write error: No space left on device<o:p></o:p></p>
<p class="MsoPlainText" style="margin-left:.5in">…<o:p></o:p></p>
<p class="MsoPlainText" style="margin-left:.5in">env: write error: No space left on device<o:p></o:p></p>
<p class="MsoPlainText" style="margin-left:.5in">df: write error: No space left on device<o:p></o:p></p>
<p class="MsoPlainText" style="margin-left:.5in">cat: write error: No space left on device<o:p></o:p></p>
<p class="MsoPlainText" style="margin-left:.5in">cat: write error: No space left on device<o:p></o:p></p>
<p class="MsoPlainText" style="margin-left:.5in">_swiftwrap.staging: line 45: echo: write error: No space left on device<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText">However, I can’t see how this could be since each node has 16GB of RAM (available as RAM or ramdisk). Is there a way to look into this further?<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText">Greg Bronevetsky<o:p></o:p></p>
<p class="MsoPlainText">Lawrence Livermore National Lab<o:p></o:p></p>
<p class="MsoPlainText">(925) 424-5756<o:p></o:p></p>
<p class="MsoPlainText">bronevetsky@llnl.gov<o:p></o:p></p>
<p class="MsoPlainText">http://greg.bronevetsky.com<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText">-----Original Message-----<br>
From: Mihael Hategan [mailto:hategan@mcs.anl.gov] <br>
Sent: Friday, May 23, 2014 1:23 PM<br>
To: Bronevetsky, Greg<br>
Cc: swift-user@ci.uchicago.edu<br>
Subject: Re: [Swift-user] Data transfer error</p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText">On Fri, 2014-05-23 at 19:32 +0000, Bronevetsky, Greg wrote:<o:p></o:p></p>
<p class="MsoPlainText">> I've now had a little more experience with this and have gotten a
<o:p></o:p></p>
<p class="MsoPlainText">> partial workaround. Whatever the underlying cause, it seems to happen
<o:p></o:p></p>
<p class="MsoPlainText">> a lot less when I disable my mechanisms to avoid re-executing tasks
<o:p></o:p></p>
<p class="MsoPlainText">> that I've already completed. Right now my guess for the root cause is
<o:p></o:p></p>
<p class="MsoPlainText">> that I'm hitting the Lustre meta-data servers too hard and they're
<o:p></o:p></p>
<p class="MsoPlainText">> throwing back occasional errors.<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText">That sounds plausible.<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText">> Specifically, I just got yelled at by our admins about performing
<o:p></o:p></p>
<p class="MsoPlainText">> thousands of file openings per second.<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText">:)<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText">> <o:p></o:p></p>
<p class="MsoPlainText">> I just did a small run and got some failures. e.g.:<o:p></o:p></p>
<p class="MsoPlainText">> Progress: time: Fri, 23 May 2014 12:25:54 -0700 Selecting site:2723
<o:p></o:p></p>
<p class="MsoPlainText">> Submitted:216 Active:119 Stage out:16 Finished successfully:58
<o:p></o:p></p>
<p class="MsoPlainText">> Failed but can retry:144<o:p></o:p></p>
<p class="MsoPlainText">> <o:p></o:p></p>
<p class="MsoPlainText">> However, when I looked at the log files generated when I set
<o:p></o:p></p>
<p class="MsoPlainText">> workerLoggingLevel to DEBUG as well as the stdout and stderr of the
<o:p></o:p></p>
<p class="MsoPlainText">> SLURM scripts I didn't find any failures or errors. What should I be
<o:p></o:p></p>
<p class="MsoPlainText">> looking for?<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText">Those are probably swift-level errors, and the details would be in the swift log (or on stdout once the run finished).<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
<p class="MsoPlainText">Mihael<o:p></o:p></p>
<p class="MsoPlainText"><o:p> </o:p></p>
</div>
</body>
</html>