[Swift-user] Data transfer error
Mihael Hategan
hategan at mcs.anl.gov
Fri May 23 14:17:16 CDT 2014
>From this run, do you happen to have a log called
"worker-0522-0109210-000001.log"?
The missing exit code file errors combined with a missing log from a
worker could indicate that the node on which this is running may not
have the home directory mounted properly.
Said block seems to fail pretty quickly without running any jobs, so I
suspect something in its environment isn't quite right. Though bad nodes
may be a little hard to track.
It may also be helpful to disable lazy errors until you get things to
run reliably.
Mihael
On Thu, 2014-05-22 at 16:11 +0000, Bronevetsky, Greg wrote:
> There are a number of places. I think this got improved a bit post 0.94, but that's another story.
>
>
>
> Anyway, first place is the swift log (<scriptName>-<runId>.log in the directory where you ran swift).
>
> I’m not seeing much here. There are the periodic warnings like:
>
> 2014-05-22 08:53:50,494-0700 INFO RuntimeStats$ProgressTicker Selecting site:39311 Stage in:1 Submitting:1 Stage out:284 Finished successfully:3118 Failed but can retry:646
>
> And occasional errors like:
>
> Block Block task status changed: Failed Exitcode file (/g/g15/bronevet/.globus/scripts/PBS7360428055973706159.submit.exitcode) not found 5 queue polls after the job was reported done
>
> However, I can’t see the file in question at the reported path.
>
>
>
> Also, when the run finally fails (I ran with -lazy.errors true to keep it going as far as possible) I get the following:
>
> Exception in runModel:
>
> Arguments: [--solver, bicg, --precond, diag, --matrix, nasa1824, --num_runs, 100, --modelType, contModel, --faultModel, n, --locModel, local, --ap, 1e-2, --am, 1e-4, --sp, 1e-2, --sm, 1e-4, --dp, 1e-2, --dm, 1e-4, --mp, 1e-2, --mm, 1e-4, --psp, 1e-2, --psm, 1e-2, --ptsp, 1e-2, --ptsm, 1e-2, --cprob, 1e-5, --exec_time, 5.591000e-03, --stats, modelBlocks/stats.solver_bicg.precond_diag.mtx_nasa1824.mt_contModel.fm_n.lm_local.ap_1e-2.am_1e-4.sp_1e-2.sm_1e-4.dp_1e-2.dm_1e-4.mp_1e-2.mm_1e-4.psp_1e-2.psm_1e-2.ptsp_1e-2.ptsm_1e-2.cprob_1e-5.block_26]
>
> Host: pbatch
>
> Directory: experiments.new-20140522-0841-bhu0vze3/jobs/v/runModel-vvuy90rl
>
> Caused by: Block task failed: 0522-4108580-000002 Block task ended prematurely
>
>
>
> I’ve attached the log.
>
>
>
> The second place, if the previous one fails, is ~/.globus/coasters/*.log.
>
> ~/.globus/coasters contains the following files. No logs in my install.
>
> cscript1601720472000314596.pl cscript7039282452425599503.pl cscript8162919165195912014.pl
>
> cscript3466700121560325070.pl cscript747960757439884021.pl cscript8876053012113700888.pl
>
> cscript6877638344534390867.pl cscript7841839259853776419.pl cscript95537409038396166.pl
>
>
>
> There is yet another place that isn't enabled by default. That's the coaster worker logs. It can be enabled by saying <profile namespace="globus" key="workerLoggingLevel">DEBUG</profile> in sites.xml. It will produce some additional logs in ~/.globus/coasters/.
>
> Done and attached. Please let me know if you see anything.
>
>
>
> Greg Bronevetsky
>
> Lawrence Livermore National Lab
>
> (925) 424-5756
>
> bronevetsky at llnl.gov
>
> http://greg.bronevetsky.com
>
>
>
>
>
> -----Original Message-----
> From: Mihael Hategan [mailto:hategan at mcs.anl.gov]
> Sent: Thursday, May 22, 2014 12:23 AM
> To: Bronevetsky, Greg
> Cc: swift-user at ci.uchicago.edu
> Subject: Re: [Swift-user] Data transfer error
>
>
>
> There are a number of places. I think this got improved a bit post 0.94, but that's another story.
>
>
>
> Anyway, first place is the swift log (<scriptName>-<runId>.log in the directory where you ran swift).
>
>
>
> The second place, if the previous one fails, is ~/.globus/coasters/*.log.
>
>
>
> There is yet another place that isn't enabled by default. That's the coaster worker logs. It can be enabled by saying <profile namespace="globus" key="workerLoggingLevel">DEBUG</profile> in sites.xml. It will produce some additional logs in ~/.globus/coasters/.
>
>
>
> Please feel free to send any/all these our way. We might be able to quickly spot some obvious problems.
>
>
>
> Mihael
>
>
>
> On Wed, 2014-05-21 at 23:59 +0000, Bronevetsky, Greg wrote:
>
> > Where should I look to debug the following error?
>
> > Caused by: Block task failed: 0521-5404270-000009 Block task ended
>
> > prematurely
>
> >
>
> > Greg Bronevetsky
>
> > Lawrence Livermore National Lab
>
> > (925) 424-5756
>
> > bronevetsky at llnl.gov<mailto:bronevetsky at llnl.gov>
>
> > http://greg.bronevetsky.com
>
> >
>
> >
>
> > -----Original Message-----
>
> > From: Mihael Hategan [mailto:hategan at mcs.anl.gov]
>
> > Sent: Wednesday, May 21, 2014 2:10 PM
>
> > To: Bronevetsky, Greg
>
> > Cc: swift-user at ci.uchicago.edu<mailto:swift-user at ci.uchicago.edu>
>
> > Subject: Re: [Swift-user] Data transfer error
>
> >
>
> > Hi,
>
> >
>
> > Sorry for the late reply (to your previous mail mentioning this).
>
> >
>
> > I don't know what the answer to your question is. It shouldn't be happening.
>
> >
>
> > However, a directory called <scriptName>-<timestamp>-<runid>.d should be created by swift. That directory should contain one or more *.info file which may contain a few more details.
>
> >
>
> > Mihael
>
> >
>
> > On Wed, 2014-05-21 at 20:50 +0000, Bronevetsky, Greg wrote:
>
> > > Related question: what causes the following error?
>
> > > Caused by: Failed to move output file
>
> > > solver_bicg.precond_diag.mtx_nasa1824/FI/blocks/block_10 to shared directory I see the file in the swift work directory and the path solver_bicg.precond_diag.mtx_nasa1824/FI/blocks/ exists in the directory where I run the script.
>
> > >
>
> > > Greg Bronevetsky
>
> > > Lawrence Livermore National Lab
>
> > > (925) 424-5756
>
> > > bronevetsky at llnl.gov<mailto:bronevetsky at llnl.gov<mailto:bronevetsky at llnl.gov%3cmailto:bronevetsky at llnl.gov>>
>
> > > http://greg.bronevetsky.com
>
> > >
>
> > > From: Bronevetsky, Greg
>
> > > Sent: Tuesday, May 20, 2014 2:11 PM
>
> > > To: swift-user at ci.uchicago.edu<mailto:swift-user at ci.uchicago.edu>
>
> > > Subject: Data transfer error
>
> > >
>
> > > I sometimes get the following error in my Swift runs:
>
> > > Caused by: Failed to move output file
>
> > > solver_bicg.precond_diag.mtx_nasa1824/mt_fmodel/fm_n/lm_local/allStats to shared directory What causes it and how can I avoid it?
>
> > >
>
> > > Greg Bronevetsky
>
> > > Lawrence Livermore National Lab
>
> > > (925) 424-5756
>
> > > bronevetsky at llnl.gov<mailto:bronevetsky at llnl.gov<mailto:bronevetsky at llnl.gov%3cmailto:bronevetsky at llnl.gov>>
>
> > > http://greg.bronevetsky.com
>
> > >
>
> > > _______________________________________________
>
> > > Swift-user mailing list
>
> > > Swift-user at ci.uchicago.edu<mailto:Swift-user at ci.uchicago.edu>
>
> > > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>
> >
>
> >
>
>
>
>
More information about the Swift-user
mailing list