[Swift-user] cannot submit on fusion

wilde at mcs.anl.gov wilde at mcs.anl.gov
Sat Mar 13 12:52:07 CST 2010


----- "Marcin Hitczenko" <marcin at galton.uchicago.edu> wrote:

> Hi Mike,
> 
> So, I think I know what it might be. It seems that the two projects I
> am
> listed under do not have any cpu core time available, which would
> explain
> why qsub is not working. So, I guess the problem has nothing to do
> with
> swift.

Cool. Can you get more hours added?  Want to try running some stuff at CI? (perhaps we can do this transparently), by extending the sites.xml below.

> I actually have one other question. When I do finally submit jobs,
> will
> swift make use of the fact that I have 8 processors on each node? Is
> there
> anything I need to add to sites.xml or tc.data so that I do not waste
> any
> available processors, as I only need one per job?

Good point. This depends a lot on how PBS is treating your job specs, which I must admit I still have some unanswered questions regarding.

*If* PBS is treating a request for 1 node as a request for 1 core, it will assign multiple 1-core job requests to the same 8-core node. qstat -n and qstat -f will help us determine whats happening; also reading the Fusion online info and asking a question on its support list.

We can guarantee whats happening by using Swift "coasters" using the sites.xml entry below, and thus more precisely craft how your jobs will be mapped to PBS:

  <pool handle="pbs">
    <execution provider="coaster" url="none" jobmanager="local:pbs"/>

    <profile namespace="globus" key="workersPerNode">8</profile>
    <profile namespace="globus" key="maxTime">3500</profile>         <!-- ADJUST -->
    <profile namespace="globus" key="maxWallTime">00:00:30</profile> <!-- ADJUST -->

    <!-- Run up to 8 PBS jobs of 4 nodes each (up to 256 cores total): -->
    <profile namespace="globus" key="slots">8</profile>
    <profile namespace="globus" key="nodeGranularity">4</profile>
    <profile namespace="globus" key="maxNodes">4</profile>

    <!-- run up to 256 app() tasks at once: (2.55*100)+1 -->
    <profile namespace="karajan" key="jobThrottle">2.55</profile>   
    <profile namespace="karajan" key="initialScore">10000</profile>

    <filesystem provider="local" url="none"/>
    <workdirectory>/home/wilde/swiftwork</workdirectory>            <!-- ADJUST -->
  </pool>

Do you want to try to replicate your data to PADS to try something similar there?

Also, do you perchance have any Grid certificates (TeraGrid, OSG/DOEGrids)? If so, you can leave your data in place and run on PADS and/or TeraPort. If not, I'll see if we can configure coasters to use a dummy certificate to enable you to run coasters over ssh on remote machines.

- Mike

> Thanks again for the help. Please let me know if you would like more
> detail about my swift use. I am happy to help and give feedback.
> 
> Marcin
> 
> > Marcin, I took the liberty of poking around your work dir on
> fusion.
> >
> > The problem seems to be that qsub is rejecting the job that swift s
> > submitting to it:
> >
> > Caused by: Cannot submit job: Could not submit job (qsub reported an
> exit
> > code of 1). no error output
> >
> > Now we need to find out why that is.
> >
> > I see that your tc.data file does not end in a newline. Lets try to
> get
> > rid of the message about the ":" to eliminate that as a
> possibility.
> >
> > Can you also do these things:
> >
> > - do a quick qsub test of a "echo hi" script to ensure that your
> Fusion
> > PBS project is still valid, and the qsub is working for you. In this
> test,
> > set a max wall time the same as what youre trying to set via tc.data
> (but
> > which I think is being ignored because Swift is unable to parse the
> GLOBUS
> > namespace declaration from that line)
> >
> > - see if there are recent files under $HOME/.globus/scripts or
> other
> > directories under .globus (which I cannot access) which may contain
> a clue
> > as to why PBS rejecting the job.
> >
> > - Mike
> >
> > ----- "Marcin Hitczenko" <marcin at galton.uchicago.edu> wrote:
> >
> >> Hi Mike,
> >>
> >> When I look at tc.data it seems to be fine (I made sure and ran
> again
> >> and
> >> got the same error). I also have not changed tc.data since I ran
> it
> >> last
> >> and I seem to remember getting the same error about the illegal
> >> character
> >> before.
> >>
> >> I am quite sure I haven't changed anything since I ran last, so I
> am
> >> wondering if it might be some changes in fusion which I need to
> >> update
> >> for?
> >>
> > ...
> >
> >> >
> >> > I see two problems, the second likely being the result of the
> >> first:
> >> >
> >> > In your output file: [ERROR] Parsing profiles on line 21 Illegal
> >> character
> >> > ':'at position 22 :Illegal character ':'
> >> > is referring to your tc.data file. I think your Globus
> MaxWallTime
> >> profile
> >> > entry got moved to a separate line, instead of being separated
> by
> >> tabs as
> >> > the last column of the previous line.
> >> >
> >> > I suspect that may have caused jobs to get submitted to PBS with
> >> defaults
> >> > that were invalid for the default queue that your jobs are going
> >> into,
> >> > thus causing the second error: Cannot submit job: Could not
> submit
> >> job
> >> > (qsub reported an exit code of 1). no error output
> >> >
> >> > So fix tc.data, and see if this fixes the problem.
> >> >
> >> > - Mike
> >> >
> >> > ----- "Marcin Hitczenko" <marcin at galton.uchicago.edu> wrote:
> >> >
> >> >> Hi Mike,
> >> >>
> >> >> Thanks for your response. I am attaching the .log file. Also,
> the
> >> >> swift.out file I included in the original email has the output
> of
> >> my
> >> >> run.
> >> >> I am including it again.
> >> >>
> >> >> Best,
> >> >>
> >> >> Marcin
> >> >>
> >> >> > Marcin, I forgot to point out: "failed to transfer wrapper
> log"
> >> is
> >> >> just a
> >> >> > catch-all error message which means "something went wrong
> with
> >> an
> >> >> app()
> >> >> > job that Swift ran, and the job did not return the expected
> log
> >> file
> >> >> that
> >> >> > comes from the wrapper script under which Swift runs the job.
> >> We
> >> >> need to
> >> >> > improve the text of this message.
> >> >> >
> >> >> > Also, if you can, always run swift with standard output and
> >> error
> >> >> > redirected into a file, and send that file as well when you
> >> report
> >> >> a
> >> >> > problem.
> >> >> >
> >> >> > Thanks,
> >> >> >
> >> >> > Mike
> >> >> >
> >> >> > ----- "Michael Wilde" <wilde at mcs.anl.gov> wrote:
> >> >> >
> >> >> >> Marcin, can you also post the .log file from this run? (it
> will
> >> be
> >> >> >> named getcoefs*.log where * is a long unique id including
> the
> >> >> date)
> >> >> >>
> >> >> >> - Mike
> >> >> >>
> >> >> >> ----- "Marcin Hitczenko" <marcin at galton.uchicago.edu> wrote:
> >> >> >>
> >> >> >> > Hi,
> >> >> >> >
> >> >> >> > I am trying to run swift scripts on fusion and I am
> >> encountering
> >> >> an
> >> >> >> > error
> >> >> >> > I have never had before (The scripts I am running have
> worked
> >> >> >> before).
> >> >> >> > It
> >> >> >> > seems it is having trouble submitting job because it
> "failed
> >> to
> >> >> >> > transfer
> >> >> >> > wrapper log". I am including my swift script, tc.data,
> >> sites.xml
> >> >> >> and
> >> >> >> > the
> >> >> >> > output file of when I ran it.
> >> >> >> >
> >> >> >> > I am not sure if I need to change anything? Like I said, I
> am
> >> >> sure
> >> >> >> > the
> >> >> >> > same script worked a month or so ago.
> >> >> >> >
> >> >> >> > Thanks for your help.
> >> >> >> >
> >> >> >> > Best,
> >> >> >> >
> >> >> >> > Marcin
> >> >> >> >
> >> >> >> > _______________________________________________
> >> >> >> > Swift-user mailing list
> >> >> >> > Swift-user at ci.uchicago.edu
> >> >> >> > http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
> >> >> >>
> >> >> >> --
> >> >> >> Michael Wilde
> >> >> >> Computation Institute, University of Chicago
> >> >> >> Mathematics and Computer Science Division
> >> >> >> Argonne National Laboratory
> >> >> >>
> >> >> >> _______________________________________________
> >> >> >> Swift-user mailing list
> >> >> >> Swift-user at ci.uchicago.edu
> >> >> >> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
> >> >> >
> >> >> > --
> >> >> > Michael Wilde
> >> >> > Computation Institute, University of Chicago
> >> >> > Mathematics and Computer Science Division
> >> >> > Argonne National Laboratory
> >> >> >
> >> >
> >> > --
> >> > Michael Wilde
> >> > Computation Institute, University of Chicago
> >> > Mathematics and Computer Science Division
> >> > Argonne National Laboratory
> >> >
> >
> > --
> > Michael Wilde
> > Computation Institute, University of Chicago
> > Mathematics and Computer Science Division
> > Argonne National Laboratory
> >

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-user mailing list