[Swift-user] Re: Need help debugging strange problem...
Martin Feller
feller at mcs.anl.gov
Thu Aug 7 11:32:00 CDT 2008
oh, i see i did an error here:
please replace "-b -e" by "-b -o" in the globusrun-ws options.
Martin
Martin Feller wrote:
> Andriy:
>
> Can you please try the following:
>
> submit a dummy job in batch mode to Fork and PBS and query for job status
> instead of relying for notifications:
>
> globusrun-ws -submit \
> -F
> https://grid-hg.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService
>
> -Ft Fork
> -b -e forkJob.epr
> -c /bin/hostname
>
> then try
>
> globusrun-ws -status -j forkJob.epr
>
> and see if you see changes in state of your job after a while
>
> Same for PBS:
>
> globusrun-ws -submit \
> -F
> https://grid-hg.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService
>
> -Ft PBS
> -b -e pbsJob.epr
> -c /bin/hostname
>
> globusrun-ws -status -j pbsJob.epr
>
> (
> later on remove those jobs calling
> globusrun-ws -kill -j pbsJob.epr
> globusrun-ws -kill -j forkJob.epr
> )
>
> If you see job state changes that had not been reported using
> globusrun-ws in
> interactive mode, then it's a notification problem. But i don't think
> this is
> the case.
> I suspect the problem is that Gram4 does not get informed about job
> state changes
> by the scheduler event generator (SEG).
> We once had the problem that the job state changes just didn't show up
> in the
> SEG logs, due to SEG <--> filesystem issues (i think it was lustre).
>
> Before speculating about this: Please run the batch jobs and tell what
> you get.
>
> Martin
>
>
>
>>> *From: *Ben Clifford <benc at hawaga.org.uk <mailto:benc at hawaga.org.uk>>
>>> *Date: *August 7, 2008 10:27:13 AM CDT
>>> *To: *Andriy Fedorov <fedorov at cs.wm.edu <mailto:fedorov at cs.wm.edu>>
>>> *Cc: *swift-user at ci.uchicago.edu <mailto:swift-user at ci.uchicago.edu>
>>> *Subject: **Re: [Swift-user] Need help debugging strange problem...*
>>>
>>> there is a somewhat common misconfiguration of gram4 on the server side
>>> where it is wired into the local queueing system incorrectly so that
>>> completion notifications do not find their way back. this matches the
>>> symptoms you describe - that fork works but that pbs doesn't, but
>>> that the
>>> job apepars to have run.
>>>
>>> I just tried a submission using the GT4 command line job submission
>>> command:
>>>
>>> $ globusrun-ws -submit -F
>>> https://grid-hg.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService
>>>
>>> -Ft Fork -job-command /bin/hostname
>>> Submitting job...
>>>
>>>
>>>
>>> but it appears to hang without submitting. not sure what is happening
>>> with
>>> that site...
>>>
>>> Aside from that, my advice for diagnosis would be to try the above
>>> command
>>> with both Fork and PBS and see if you get the same difference in
>>> behaviour
>>> between the two.
>>>
>>> --
>>> _______________________________________________
>>> Swift-user mailing list
>>> Swift-user at ci.uchicago.edu <mailto:Swift-user at ci.uchicago.edu>
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>>
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
More information about the Swift-user
mailing list