[Swift-devel] wrong file staged in

Veronika Nefedova nefedova at mcs.anl.gov
Fri Jul 6 16:53:58 CDT 2007


I didn't try another run. Something was really weird during that run.  
Some jobs just failed because the executable failed:
stderr.txt:
forrtl: No such file or directory
/home/ydeng/c34a2/exec/ia64/charmm: relocation error: /soft/intel- 
c-9.1.049-f-9.1.045/lib/libunwind.so.6: undefined symbol: ? 
1__serial_memmove

But the jobs with wrong files staged in were running (the same  
executable)...

I can repeat the run again now.

Nika

On Jul 6, 2007, at 4:49 PM, Mihael Hategan wrote:

> On Fri, 2007-07-06 at 16:44 -0500, Veronika Nefedova wrote:
>> I put the dtm file on terminable in ~nefedova/MolDyn.dtm
>>
>> I see a few more directories with wrong files staged in, but I
>> didn't
>> check them all (130+ of them). I saw at least one with the correct
>> files staged in.
>
> Across different runs that is. Do you get the exact same mess-up,  
> or is
> it different?
>
>>
>> Nika
>>
>> On Jul 6, 2007, at 4:39 PM, Mihael Hategan wrote:
>>
>>> Consistent or intermittent behavior?
>>>
>>> Also, can you attach the swift source?
>>>
>>> On Fri, 2007-07-06 at 16:37 -0500, Veronika Nefedova wrote:
>>>> Nope... I checked with grep:
>>>>
>>>> nefedova at viper:~/alamines> grep solv_repu_0DOT9_1_b1_prt MolDyn.dtm
>>>> file solv_repu_0DOT9_1_b1_prt <"solv_repu_0.9_1_b1.prt">;
>>>> (whamfiles[67] , solv_repu_0DOT9_1_b1_crd,  
>>>> solv_repu_0DOT9_1_b1_out,
>>>> solv_repu_0DOT9_1_b1_done) = CHARMM3 (standn, gaff_prm, gaff_rft,
>>>> rtf_file, prm_file, psf_file, crd_eq_file,  
>>>> solv_repu_0DOT9_1_b1_prt,
>>>> ss1, s1, s2, s3, s4, s5, s7, "urandseed:5964163", sprt,  
>>>> "rcut1:0.9",
>>>> "rcut2:1");
>>>> nefedova at viper:~/alamines>
>>>>
>>>> On Jul 6, 2007, at 4:31 PM, Mihael Hategan wrote:
>>>>
>>>>> Wonder if there is another declaration of the same variable
>>>>> mapped to
>>>>> the wrong file.
>>>>>
>>>>> On Fri, 2007-07-06 at 16:03 -0500, Veronika Nefedova wrote:
>>>>>> The wrong file was staged in during the 4th stage of the
>>>>>> workflow...
>>>>>>
>>>>>> I have this inside my foreach loop:
>>>>>> <snip>
>>>>>> file solv_repu_0DOT9_1_b1_prt <"solv_repu_0.9_1_b1.prt">;
>>>>>> file solv_repu_0DOT9_1_b1_crd  <"solv_repu_0.9_1_b1.crd">;
>>>>>> file solv_repu_0DOT9_1_b1_out <"solv_repu_0.9_1_b1.out">;
>>>>>> file solv_repu_0DOT9_1_b1_done  <"solv_repu_0.9_1_b1_done">;
>>>>>>
>>>>>> (whamfiles[67] , solv_repu_0DOT9_1_b1_crd,
>>>>>> solv_repu_0DOT9_1_b1_out,
>>>>>> solv_repu_0DO\
>>>>>> T9_1_b1_done) = CHARMM3 (standn, gaff_prm, gaff_rft, rtf_file,
>>>>>> prm_file, psf_file,\
>>>>>> crd_eq_file, solv_repu_0DOT9_1_b1_prt, ss1, s1, s2, s3, s4,  
>>>>>> s5, s7,
>>>>>> "urandseed:59\
>>>>>> 64163", sprt, "rcut1:0.9", "rcut2:1");
>>>>>> <snip>
>>>>>>
>>>>>>
>>>>>> The first  file (with DOT) is an input files for CHARMM3 and  
>>>>>> three
>>>>>> last declared files (out, crd and done) are output files.
>>>>>>
>>>>>> When I check my remote directory during execution, I see that the
>>>>>> wrong files were staged in. In particular, the wrong prt file was
>>>>>> staged in:
>>>>>>
>>>>>> solv_disp_a3.prt instead of solv_repu_0.9_1_b1.prt  (aka
>>>>>> solv_repu_0DOT9_1_b1_prt)
>>>>>>
>>>>>> The solv_repu_0.9_1_b1.prt file is not produced by a previous
>>>>>> stage,
>>>>>> its being/supposed to be/ staged in from the submit host.
>>>>>>
>>>>>> The above declaration is the only place where the file
>>>>>> solv_repu_0DOT9_1_b1_prt is being declared in swift file (I did
>>>>>> grep
>>>>>> to check it). kml file also looks ok.
>>>>>>
>>>>>> I am not sure why it has happened -- this piece of code has not
>>>>>> been
>>>>>> changed from the previous version...
>>>>>>
>>>>>>
>>>>>> This is the work directory for this job (CHARMM3) on TG-UC:
>>>>>>
>>>>>> nefedova at tg-login1:/disks/scratchgpfs1/iraicu/MolDyn- 
>>>>>> zvlc1f9c03pf0/
>>>>>> chrm_long-p2v28ydi> ls
>>>>>> m001_am1.prm           solv.inp          solv_m001_eq.crd
>>>>>> stderr.txt
>>>>>> m001_am1.rtf           solv_disp_a3.out  solv_repu_0.9_1_b1.rst
>>>>>> parm03_gaff_all.rtf    solv_disp_a3.prt  solv_repu_0.9_1_b1.trj
>>>>>> parm03_gaffnb_all.prm  solv_m001.psf     solv_repu_0.9_1_b1.wham
>>>>>> nefedova at tg-login1:/disks/scratchgpfs1/iraicu/MolDyn- 
>>>>>> zvlc1f9c03pf0/
>>>>>> chrm_long-p2v28ydi>
>>>>>>
>>>>>> as you can see 2 files have the wrong names (solv_disp_a3
>>>>>> instead of
>>>>>> solv_repu_0.9_1_b1 ) and execution is screwed up since the wrong
>>>>>> parameter file (prt) was staged in...
>>>>>>
>>>>>>
>>>>>> I checked whether that file was even staged in to the remote
>>>>>> host --
>>>>>> in fact it was:
>>>>>>
>>>>>> nefedova at tg-login1:/disks/scratchgpfs1/iraicu/MolDyn- 
>>>>>> zvlc1f9c03pf0>
>>>>>> find */ -name solv_repu_0.9_1_b1.prt -print
>>>>>> shared/solv_repu_0.9_1_b1.prt
>>>>>> But it never went to the right working directory...
>>>>>>
>>>>>> Any idea what is going on here?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Nika
>>>>>>
>>>>>> _______________________________________________
>>>>>> Swift-devel mailing list
>>>>>> Swift-devel at ci.uchicago.edu
>>>>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>>>>>
>>>>>
>>>>
>>>
>>
>




More information about the Swift-devel mailing list