[Swift-user] Re: Need help debugging strange problem...

Thu Aug 7 11:32:00 CDT 2008

oh, i see i did an error here:
please replace "-b -e" by "-b -o" in the globusrun-ws options.

Martin

Martin Feller wrote:
> Andriy:
> 
> Can you please try the following:
> 
> submit a dummy job in batch mode to Fork and PBS and query for job status
> instead of relying for notifications:
> 
> globusrun-ws -submit \
>   -F 
> https://grid-hg.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService 
> 
>   -Ft Fork
>   -b -e forkJob.epr
>   -c /bin/hostname
> 
> then try
> 
> globusrun-ws -status -j forkJob.epr
> 
> and see if you see changes in state of your job after a while
> 
> Same for PBS:
> 
> globusrun-ws -submit \
>   -F 
> https://grid-hg.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService 
> 
>   -Ft PBS
>   -b -e pbsJob.epr
>   -c /bin/hostname
> 
> globusrun-ws -status -j pbsJob.epr
> 
> (
>  later on remove those jobs calling
>  globusrun-ws -kill -j pbsJob.epr
>  globusrun-ws -kill -j forkJob.epr
> )
> 
> If you see job state changes that had not been reported using 
> globusrun-ws in
> interactive mode, then it's a notification problem. But i don't think 
> this is
> the case.
> I suspect the problem is that Gram4 does not get informed about job 
> state changes
> by the scheduler event generator (SEG).
> We once had the problem that the job state changes just didn't show up 
> in the
> SEG logs, due to SEG <--> filesystem issues (i think it was lustre).
> 
> Before speculating about this: Please run the batch jobs and tell what 
> you get.
> 
> Martin
> 
> 
> 
>>> *From: *Ben Clifford <benc at hawaga.org.uk <mailto:benc at hawaga.org.uk>>
>>> *Date: *August 7, 2008 10:27:13 AM CDT
>>> *To: *Andriy Fedorov <fedorov at cs.wm.edu <mailto:fedorov at cs.wm.edu>>
>>> *Cc: *swift-user at ci.uchicago.edu <mailto:swift-user at ci.uchicago.edu>
>>> *Subject: **Re: [Swift-user] Need help debugging strange problem...*
>>>
>>> there is a somewhat common misconfiguration of gram4 on the server side
>>> where it is wired into the local queueing system incorrectly so that
>>> completion notifications do not find their way back. this matches the
>>> symptoms you describe - that fork works but that pbs doesn't, but 
>>> that the
>>> job apepars to have run.
>>>
>>> I just tried a submission using the GT4 command line job submission
>>> command:
>>>
>>> $ globusrun-ws -submit -F
>>> https://grid-hg.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService 
>>>
>>> -Ft Fork -job-command /bin/hostname
>>> Submitting job...
>>>
>>>
>>>
>>> but it appears to hang without submitting. not sure what is happening 
>>> with
>>> that site...
>>>
>>> Aside from that, my advice for diagnosis would be to try the above 
>>> command
>>> with both Fork and PBS and see if you get the same difference in 
>>> behaviour
>>> between the two.
>>>
>>> -- 
>>> _______________________________________________
>>> Swift-user mailing list
>>> Swift-user at ci.uchicago.edu <mailto:Swift-user at ci.uchicago.edu>
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>>
> 
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user