[Swift-user] Re: Need help debugging strange problem...

Martin Feller feller at mcs.anl.gov
Thu Aug 7 11:29:09 CDT 2008


Andriy:

Can you please try the following:

submit a dummy job in batch mode to Fork and PBS and query for job status
instead of relying for notifications:

globusrun-ws -submit \
   -F https://grid-hg.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService
   -Ft Fork
   -b -e forkJob.epr
   -c /bin/hostname

then try

globusrun-ws -status -j forkJob.epr

and see if you see changes in state of your job after a while

Same for PBS:

globusrun-ws -submit \
   -F https://grid-hg.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService
   -Ft PBS
   -b -e pbsJob.epr
   -c /bin/hostname

globusrun-ws -status -j pbsJob.epr

(
  later on remove those jobs calling
  globusrun-ws -kill -j pbsJob.epr
  globusrun-ws -kill -j forkJob.epr
)

If you see job state changes that had not been reported using globusrun-ws in
interactive mode, then it's a notification problem. But i don't think this is
the case.
I suspect the problem is that Gram4 does not get informed about job state changes
by the scheduler event generator (SEG).
We once had the problem that the job state changes just didn't show up in the
SEG logs, due to SEG <--> filesystem issues (i think it was lustre).

Before speculating about this: Please run the batch jobs and tell what you get.

Martin



>> *From: *Ben Clifford <benc at hawaga.org.uk <mailto:benc at hawaga.org.uk>>
>> *Date: *August 7, 2008 10:27:13 AM CDT
>> *To: *Andriy Fedorov <fedorov at cs.wm.edu <mailto:fedorov at cs.wm.edu>>
>> *Cc: *swift-user at ci.uchicago.edu <mailto:swift-user at ci.uchicago.edu>
>> *Subject: **Re: [Swift-user] Need help debugging strange problem...*
>>
>> there is a somewhat common misconfiguration of gram4 on the server side
>> where it is wired into the local queueing system incorrectly so that
>> completion notifications do not find their way back. this matches the
>> symptoms you describe - that fork works but that pbs doesn't, but that 
>> the
>> job apepars to have run.
>>
>> I just tried a submission using the GT4 command line job submission
>> command:
>>
>> $ globusrun-ws -submit -F
>> https://grid-hg.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService 
>>
>> -Ft Fork -job-command /bin/hostname
>> Submitting job...
>>
>>
>>
>> but it appears to hang without submitting. not sure what is happening 
>> with
>> that site...
>>
>> Aside from that, my advice for diagnosis would be to try the above 
>> command
>> with both Fork and PBS and see if you get the same difference in 
>> behaviour
>> between the two.
>>
>> -- 
>> _______________________________________________
>> Swift-user mailing list
>> Swift-user at ci.uchicago.edu <mailto:Swift-user at ci.uchicago.edu>
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
> 




More information about the Swift-user mailing list