[Swift-user] Data transfer error
Bronevetsky, Greg
bronevetsky1 at llnl.gov
Thu May 22 11:11:24 CDT 2014
There are a number of places. I think this got improved a bit post 0.94, but that's another story.
Anyway, first place is the swift log (<scriptName>-<runId>.log in the directory where you ran swift).
I’m not seeing much here. There are the periodic warnings like:
2014-05-22 08:53:50,494-0700 INFO RuntimeStats$ProgressTicker Selecting site:39311 Stage in:1 Submitting:1 Stage out:284 Finished successfully:3118 Failed but can retry:646
And occasional errors like:
Block Block task status changed: Failed Exitcode file (/g/g15/bronevet/.globus/scripts/PBS7360428055973706159.submit.exitcode) not found 5 queue polls after the job was reported done
However, I can’t see the file in question at the reported path.
Also, when the run finally fails (I ran with -lazy.errors true to keep it going as far as possible) I get the following:
Exception in runModel:
Arguments: [--solver, bicg, --precond, diag, --matrix, nasa1824, --num_runs, 100, --modelType, contModel, --faultModel, n, --locModel, local, --ap, 1e-2, --am, 1e-4, --sp, 1e-2, --sm, 1e-4, --dp, 1e-2, --dm, 1e-4, --mp, 1e-2, --mm, 1e-4, --psp, 1e-2, --psm, 1e-2, --ptsp, 1e-2, --ptsm, 1e-2, --cprob, 1e-5, --exec_time, 5.591000e-03, --stats, modelBlocks/stats.solver_bicg.precond_diag.mtx_nasa1824.mt_contModel.fm_n.lm_local.ap_1e-2.am_1e-4.sp_1e-2.sm_1e-4.dp_1e-2.dm_1e-4.mp_1e-2.mm_1e-4.psp_1e-2.psm_1e-2.ptsp_1e-2.ptsm_1e-2.cprob_1e-5.block_26]
Host: pbatch
Directory: experiments.new-20140522-0841-bhu0vze3/jobs/v/runModel-vvuy90rl
Caused by: Block task failed: 0522-4108580-000002 Block task ended prematurely
I’ve attached the log.
The second place, if the previous one fails, is ~/.globus/coasters/*.log.
~/.globus/coasters contains the following files. No logs in my install.
cscript1601720472000314596.pl cscript7039282452425599503.pl cscript8162919165195912014.pl
cscript3466700121560325070.pl cscript747960757439884021.pl cscript8876053012113700888.pl
cscript6877638344534390867.pl cscript7841839259853776419.pl cscript95537409038396166.pl
There is yet another place that isn't enabled by default. That's the coaster worker logs. It can be enabled by saying <profile namespace="globus" key="workerLoggingLevel">DEBUG</profile> in sites.xml. It will produce some additional logs in ~/.globus/coasters/.
Done and attached. Please let me know if you see anything.
Greg Bronevetsky
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky at llnl.gov
http://greg.bronevetsky.com
-----Original Message-----
From: Mihael Hategan [mailto:hategan at mcs.anl.gov]
Sent: Thursday, May 22, 2014 12:23 AM
To: Bronevetsky, Greg
Cc: swift-user at ci.uchicago.edu
Subject: Re: [Swift-user] Data transfer error
There are a number of places. I think this got improved a bit post 0.94, but that's another story.
Anyway, first place is the swift log (<scriptName>-<runId>.log in the directory where you ran swift).
The second place, if the previous one fails, is ~/.globus/coasters/*.log.
There is yet another place that isn't enabled by default. That's the coaster worker logs. It can be enabled by saying <profile namespace="globus" key="workerLoggingLevel">DEBUG</profile> in sites.xml. It will produce some additional logs in ~/.globus/coasters/.
Please feel free to send any/all these our way. We might be able to quickly spot some obvious problems.
Mihael
On Wed, 2014-05-21 at 23:59 +0000, Bronevetsky, Greg wrote:
> Where should I look to debug the following error?
> Caused by: Block task failed: 0521-5404270-000009 Block task ended
> prematurely
>
> Greg Bronevetsky
> Lawrence Livermore National Lab
> (925) 424-5756
> bronevetsky at llnl.gov<mailto:bronevetsky at llnl.gov>
> http://greg.bronevetsky.com
>
>
> -----Original Message-----
> From: Mihael Hategan [mailto:hategan at mcs.anl.gov]
> Sent: Wednesday, May 21, 2014 2:10 PM
> To: Bronevetsky, Greg
> Cc: swift-user at ci.uchicago.edu<mailto:swift-user at ci.uchicago.edu>
> Subject: Re: [Swift-user] Data transfer error
>
> Hi,
>
> Sorry for the late reply (to your previous mail mentioning this).
>
> I don't know what the answer to your question is. It shouldn't be happening.
>
> However, a directory called <scriptName>-<timestamp>-<runid>.d should be created by swift. That directory should contain one or more *.info file which may contain a few more details.
>
> Mihael
>
> On Wed, 2014-05-21 at 20:50 +0000, Bronevetsky, Greg wrote:
> > Related question: what causes the following error?
> > Caused by: Failed to move output file
> > solver_bicg.precond_diag.mtx_nasa1824/FI/blocks/block_10 to shared directory I see the file in the swift work directory and the path solver_bicg.precond_diag.mtx_nasa1824/FI/blocks/ exists in the directory where I run the script.
> >
> > Greg Bronevetsky
> > Lawrence Livermore National Lab
> > (925) 424-5756
> > bronevetsky at llnl.gov<mailto:bronevetsky at llnl.gov<mailto:bronevetsky at llnl.gov%3cmailto:bronevetsky at llnl.gov>>
> > http://greg.bronevetsky.com
> >
> > From: Bronevetsky, Greg
> > Sent: Tuesday, May 20, 2014 2:11 PM
> > To: swift-user at ci.uchicago.edu<mailto:swift-user at ci.uchicago.edu>
> > Subject: Data transfer error
> >
> > I sometimes get the following error in my Swift runs:
> > Caused by: Failed to move output file
> > solver_bicg.precond_diag.mtx_nasa1824/mt_fmodel/fm_n/lm_local/allStats to shared directory What causes it and how can I avoid it?
> >
> > Greg Bronevetsky
> > Lawrence Livermore National Lab
> > (925) 424-5756
> > bronevetsky at llnl.gov<mailto:bronevetsky at llnl.gov<mailto:bronevetsky at llnl.gov%3cmailto:bronevetsky at llnl.gov>>
> > http://greg.bronevetsky.com
> >
> > _______________________________________________
> > Swift-user mailing list
> > Swift-user at ci.uchicago.edu<mailto:Swift-user at ci.uchicago.edu>
> > https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-user
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20140522/013b30c2/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: worker-0522-0109210-000000.log.bz2
Type: application/octet-stream
Size: 546622 bytes
Desc: worker-0522-0109210-000000.log.bz2
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20140522/013b30c2/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: experiments.new-20140522-0901-mke7zei7.log.bz2
Type: application/octet-stream
Size: 638485 bytes
Desc: experiments.new-20140522-0901-mke7zei7.log.bz2
URL: <http://lists.mcs.anl.gov/pipermail/swift-user/attachments/20140522/013b30c2/attachment-0001.obj>
More information about the Swift-user
mailing list