[Swift-user] Data transfer error
Bronevetsky, Greg
bronevetsky1 at llnl.gov
Fri May 23 14:52:20 CDT 2014
Looking deeper through the logs I'm noticing the following messages in my -info files:
"Job directory mode is: link on shared filesystem"
Googling around I noticed that another mode is "local copy". Would running in this mode alleviate pressure on the global scratch file system? If so, how do I use it? Thanks!
Greg Bronevetsky
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky at llnl.gov
http://greg.bronevetsky.com
-----Original Message-----
From: swift-user-bounces at ci.uchicago.edu [mailto:swift-user-bounces at ci.uchicago.edu] On Behalf Of Bronevetsky, Greg
Sent: Friday, May 23, 2014 12:33 PM
To: Mihael Hategan
Cc: swift-user at ci.uchicago.edu
Subject: Re: [Swift-user] Data transfer error
I've now had a little more experience with this and have gotten a partial workaround. Whatever the underlying cause, it seems to happen a lot less when I disable my mechanisms to avoid re-executing tasks that I've already completed. Right now my guess for the root cause is that I'm hitting the Lustre meta-data servers too hard and they're throwing back occasional errors. Specifically, I just got yelled at by our admins about performing thousands of file openings per second.
I just did a small run and got some failures. e.g.:
Progress: time: Fri, 23 May 2014 12:25:54 -0700 Selecting site:2723 Submitted:216 Active:119 Stage out:16 Finished successfully:58 Failed but can retry:144
However, when I looked at the log files generated when I set workerLoggingLevel to DEBUG as well as the stdout and stderr of the SLURM scripts I didn't find any failures or errors. What should I be looking for?
Greg Bronevetsky
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky at llnl.gov
http://greg.bronevetsky.com
-----Original Message-----
From: Mihael Hategan [mailto:hategan at mcs.anl.gov]
Sent: Friday, May 23, 2014 12:17 PM
To: Bronevetsky, Greg
Cc: swift-user at ci.uchicago.edu
Subject: Re: [Swift-user] Data transfer error
From this run, do you happen to have a log called "worker-0522-0109210-000001.log"?
The missing exit code file errors combined with a missing log from a worker could indicate that the node on which this is running may not have the home directory mounted properly.
Said block seems to fail pretty quickly without running any jobs, so I suspect something in its environment isn't quite right. Though bad nodes may be a little hard to track.
It may also be helpful to disable lazy errors until you get things to run reliably.
Mihael
On Thu, 2014-05-22 at 16:11 +0000, Bronevetsky, Greg wrote:
> There are a number of places. I think this got improved a bit post 0.94, but that's another story.
>
>
>
> Anyway, first place is the swift log (<scriptName>-<runId>.log in the directory where you ran swift).
>
> I’m not seeing much here. There are the periodic warnings like:
>
> 2014-05-22 08:53:50,494-0700 INFO RuntimeStats$ProgressTicker Selecting site:39311 Stage in:1 Submitting:1 Stage out:284 Finished successfully:3118 Failed but can retry:646
>
> And occasional errors like:
>
> Block Block task status changed: Failed Exitcode file
> (/g/g15/bronevet/.globus/scripts/PBS7360428055973706159.submit.exitcod
> e) not found 5 queue polls after the job was reported done
>
> However, I can’t see the file in question at the reported path.
>
>
>
> Also, when the run finally fails (I ran with -lazy.errors true to keep it going as far as possible) I get the following:
>
> Exception in runModel:
>
> Arguments: [--solver, bicg, --precond, diag, --matrix, nasa1824,
> --num_runs, 100, --modelType, contModel, --faultModel, n, --locModel,
> local, --ap, 1e-2, --am, 1e-4, --sp, 1e-2, --sm, 1e-4, --dp, 1e-2,
> --dm, 1e-4, --mp, 1e-2, --mm, 1e-4, --psp, 1e-2, --psm, 1e-2, --ptsp,
> 1e-2, --ptsm, 1e-2, --cprob, 1e-5, --exec_time, 5.591000e-03, --stats,
> modelBlocks/stats.solver_bicg.precond_diag.mtx_nasa1824.mt_contModel.f
> m_n.lm_local.ap_1e-2.am_1e-4.sp_1e-2.sm_1e-4.dp_1e-2.dm_1e-4.mp_1e-2.m
> m_1e-4.psp_1e-2.psm_1e-2.ptsp_1e-2.ptsm_1e-2.cprob_1e-5.block_26]
>
> Host: pbatch
>
> Directory:
> experiments.new-20140522-0841-bhu0vze3/jobs/v/runModel-vvuy90rl
>
> Caused by: Block task failed: 0522-4108580-000002 Block task ended
> prematurely
>
>
>
> I’ve attached the log.
>
>
>
> The second place, if the previous one fails, is ~/.globus/coasters/*.log.
>
> ~/.globus/coasters contains the following files. No logs in my install.
>
> cscript1601720472000314596.pl cscript7039282452425599503.pl
> cscript8162919165195912014.pl
>
> cscript3466700121560325070.pl cscript747960757439884021.pl cscript8876053012113700888.pl
>
> cscript6877638344534390867.pl cscript7841839259853776419.pl
> cscript95537409038396166.pl
>
>
>
> There is yet another place that isn't enabled by default. That's the coaster worker logs. It can be enabled by saying <profile namespace="globus" key="workerLoggingLevel">DEBUG</profile> in sites.xml. It will produce some additional logs in ~/.globus/coasters/.
>
> Done and attached. Please let me know if you see anything.
>
>
>
> Greg Bronevetsky
>
> Lawrence Livermore National Lab
>
> (925) 424-5756
>
> bronevetsky at llnl.gov
>
> http://greg.bronevetsky.com
>
>
>
>
>
> -----Original Message-----
> From: Mihael Hategan [mailto:hategan at mcs.anl.gov]
> Sent: Thursday, May 22, 2014 12:23 AM
> To: Bronevetsky, Greg
> Cc: swift-user at ci.uchicago.edu
> Subject: Re: [Swift-user] Data transfer error
>
>
>
> There are a number of places. I think this got improved a bit post 0.94, but that's another story.
>
>
>
> Anyway, first place is the swift log (<scriptName>-<runId>.log in the directory where you ran swift).
>
>
>
> The second place, if the previous one fails, is ~/.globus/coasters/*.log.
>
>
>
> There is yet another place that isn't enabled by default. That's the coaster worker logs. It can be enabled by saying <profile namespace="globus" key="workerLoggingLevel">DEBUG</profile> in sites.xml. It will produce some additional logs in ~/.globus/coasters/.
>
>
>
> Please feel free to send any/all these our way. We might be able to quickly spot some obvious problems.
>
>
>
> Mihael
>
>
>
> On Wed, 2014-05-21 at 23:59 +0000, Bronevetsky, Greg wrote:
>
> > Where should I look to debug the following error?
>
> > Caused by: Block task failed: 0521-5404270-000009 Block
> > task ended
>
> > prematurely
>
> >
>
> > Greg Bronevetsky
>
> > Lawrence Livermore National Lab
>
> > (925) 424-5756
>
> > bronevetsky at llnl.gov<mailto:bronevetsky at llnl.gov>
>
> > http://greg.bronevetsky.com
>
> >
>
> >
>
> > -----Original Message-----
>
> > From: Mihael Hategan [mailto:hategan at mcs.anl.gov]
>
> > Sent: Wednesday, May 21, 2014 2:10 PM
>
> > To: Bronevetsky, Greg
>
> > Cc: swift-user at ci.uchicago.edu<mailto:swift-user at ci.uchicago.edu>
>
> > Subject: Re: [Swift-user] Data transfer error
>
> >
>
> > Hi,
>
> >
>
> > Sorry for the late reply (to your previous mail mentioning this).
>
> >
>
> > I don't know what the answer to your question is. It shouldn't be happening.
>
> >
>
> > However, a directory called <scriptName>-<timestamp>-<runid>.d should be created by swift. That directory should contain one or more *.info file which may contain a few more details.
>
> >
>
> > Mihael
>
> >
>
> > On Wed, 2014-05-21 at 20:50 +0000, Bronevetsky, Greg wrote:
>
> > > Related question: what causes the following error?
>
> > > Caused by: Failed to move output file
>
> > > solver_bicg.precond_diag.mtx_nasa1824/FI/blocks/block_10 to shared directory I see the file in the swift work directory and the path solver_bicg.precond_diag.mtx_nasa1824/FI/blocks/ exists in the directory where I run the script.
>
> > >
>
> > > Greg Bronevetsky
>
> > > Lawrence Livermore National Lab
>
> > > (925) 424-5756
>
> > > bronevetsky at llnl.gov<mailto:bronevetsky at llnl.gov<mailto:bronevetsk
> > > y at llnl.gov%3cmailto:bronevetsky at llnl.gov>>
>
> > > http://greg.bronevetsky.com
>
> > >
>
> > > From: Bronevetsky, Greg
>
> > > Sent: Tuesday, May 20, 2014 2:11 PM
>
> > > To: swift-user at ci.uchicago.edu<mailto:swift-user at ci.uchicago.edu>
>
> > > Subject: Data transfer error
>
> > >
>
> > > I sometimes get the following error in my Swift runs:
>
> > > Caused by: Failed to move output file
>
> > > solver_bicg.precond_diag.mtx_nasa1824/mt_fmodel/fm_n/lm_local/allStats to shared directory What causes it and how can I avoid it?
>
> > >
>
> > > Greg Bronevetsky
>
> > > Lawrence Livermore National Lab
>
> > > (925) 424-5756
>
> > > bronevetsky at llnl.gov<mailto:bronevetsky at llnl.gov<mailto:bronevetsk
> > > y at llnl.gov%3cmailto:bronevetsky at llnl.gov>>
>
> > > http://greg.bronevetsky.com
>
> > >
>
> > > _______________________________________________
>
> > > Swift-user mailing list
>
> > > Swift-user at ci.uchicago.edu<mailto:Swift-user at ci.uchicago.edu>
>
> > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>
> >
>
> >
>
>
>
>
_______________________________________________
Swift-user mailing list
Swift-user at ci.uchicago.edu
https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
More information about the Swift-user
mailing list