[Swift-user] Re: Need help debugging strange problem...

Andriy Fedorov fedorov at cs.wm.edu
Thu Aug 7 11:39:30 CDT 2008


Martin,

I tried what you suggested. The status of the job remains
"Unsubmitted" on the submission site, while I see the job completes on
NCSA Mercury.

I reported this problem to TG help, and will post an update if I hear
any explanation from them.

Andrey


> Date: Thu, 07 Aug 2008 11:29:09 -0500
> From: Martin Feller <feller at mcs.anl.gov>
> Subject: [Swift-user] Re: Need help debugging strange problem...
> To: swift-user at ci.uchicago.edu
> Message-ID: <489B22D5.6010500 at mcs.anl.gov>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> Andriy:
>
> Can you please try the following:
>
> submit a dummy job in batch mode to Fork and PBS and query for job status
> instead of relying for notifications:
>
> globusrun-ws -submit \
>   -F https://grid-hg.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService
>   -Ft Fork
>   -b -e forkJob.epr
>   -c /bin/hostname
>
> then try
>
> globusrun-ws -status -j forkJob.epr
>
> and see if you see changes in state of your job after a while
>
> Same for PBS:
>
> globusrun-ws -submit \
>   -F https://grid-hg.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService
>   -Ft PBS
>   -b -e pbsJob.epr
>   -c /bin/hostname
>
> globusrun-ws -status -j pbsJob.epr
>
> (
>  later on remove those jobs calling
>  globusrun-ws -kill -j pbsJob.epr
>  globusrun-ws -kill -j forkJob.epr
> )
>
> If you see job state changes that had not been reported using globusrun-ws in
> interactive mode, then it's a notification problem. But i don't think this is
> the case.
> I suspect the problem is that Gram4 does not get informed about job state changes
> by the scheduler event generator (SEG).
> We once had the problem that the job state changes just didn't show up in the
> SEG logs, due to SEG <--> filesystem issues (i think it was lustre).
>
> Before speculating about this: Please run the batch jobs and tell what you get.
>
> Martin
>
>
>
>>> *From: *Ben Clifford <benc at hawaga.org.uk <mailto:benc at hawaga.org.uk>>
>>> *Date: *August 7, 2008 10:27:13 AM CDT
>>> *To: *Andriy Fedorov <fedorov at cs.wm.edu <mailto:fedorov at cs.wm.edu>>
>>> *Cc: *swift-user at ci.uchicago.edu <mailto:swift-user at ci.uchicago.edu>
>>> *Subject: **Re: [Swift-user] Need help debugging strange problem...*
>>>
>>> there is a somewhat common misconfiguration of gram4 on the server side
>>> where it is wired into the local queueing system incorrectly so that
>>> completion notifications do not find their way back. this matches the
>>> symptoms you describe - that fork works but that pbs doesn't, but that
>>> the
>>> job apepars to have run.
>>>
>>> I just tried a submission using the GT4 command line job submission
>>> command:
>>>
>>> $ globusrun-ws -submit -F
>>> https://grid-hg.ncsa.teragrid.org:8443/wsrf/services/ManagedJobFactoryService
>>>
>>> -Ft Fork -job-command /bin/hostname
>>> Submitting job...
>>>
>>>
>>>
>>> but it appears to hang without submitting. not sure what is happening
>>> with
>>> that site...
>>>
>>> Aside from that, my advice for diagnosis would be to try the above
>>> command
>>> with both Fork and PBS and see if you get the same difference in
>>> behaviour
>>> between the two.
>>>
>>> --
>>> _______________________________________________
>>> Swift-user mailing list
>>> Swift-user at ci.uchicago.edu <mailto:Swift-user at ci.uchicago.edu>
>>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>>
>
>
>
> ------------------------------
>
> _______________________________________________
> Swift-user mailing list
> Swift-user at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-user
>
>
> End of Swift-user Digest, Vol 17, Issue 5
> *****************************************
>



More information about the Swift-user mailing list