[Swift-devel] Q about MolDyn

Ioan Raicu iraicu at cs.uchicago.edu
Fri Aug 3 23:03:03 CDT 2007


Hi,
Nika can probably be more specific, but the last time we ran the 244 
molecule MolDyn, the workflow failed on the last few jobs, and the 
failures were application specific, not Swift or Falkon.  I believe the 
specific issue that caused those jobs to fail has been resolved. 

We have made another attempt at the MolDyn 244 molecule run, and from 
what I can tell, it did not complete successfully again.  We were 
supposed to have 20497 jobs...

1 	1 	1
1 	244 	244
1 	244 	244
68 	244 	16592
1 	244 	244
11 	244 	2684
1 	244 	244
1 	244 	244

	
	

	
	20497


but we have:
20482 with exit code 0
1 with exit code -3
2 with exit code 253

I forgot to enable the debug at the workers, so I don't know what the 
STDOUT and STDERR was for these 3 jobs.  Given that Swift retries 3 
times a job before it fails the workflow, my guess is that these 3 jobs 
were really the same job failing 3 times.  The failure occurred on 3 
different machines, so I don't think it was machine related.  Nika, can 
you tell from the various Swift logs what happened to these 3 jobs?  Is 
this the same issue as we had on the last 244 mol run?  It looks like we 
failed the workflow with 15 jobs to go. 

The graphs all look nice, similar to the last ones we had.  If people 
really want to see them, I can generate them again.  Otherwise, look at 
http://tg-viz-login1.uc.teragrid.org:51000/index.htm to see the last 10K 
samples of the experiment.

Nika, after you try to figure out what happened, can you simply retry 
the workflow, maybe it will manage to finish the last 15 jobs.  
Depending on what problem we find, I think we might conclude that 3 
retries is not enough, and we might want to have a higher number as the 
default when running with Falkon.  If the error was an application 
error, then no matter how many retries we have, it won't make any 
difference.

Ioan



Michael Wilde wrote:
> Im catching up from some of this weeks email.
>
> I didnt see a followup to this, nor can I tell which two jobs Ian is 
> referring to or where those came from. Can anyone clarify what this 
> issue is here?
>
>
> Ian Foster wrote:
>> Hi,
>>
>> I am curious whether we found out why those two jobs (?) were failing 
>> at the end of the big MolDyn run?
>>
>> Ian.
>>
>> _______________________________________________
>> Swift-devel mailing list
>> Swift-devel at ci.uchicago.edu
>> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>>
>>
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> http://mail.ci.uchicago.edu/mailman/listinfo/swift-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20070803/f83a119e/attachment.html>


More information about the Swift-devel mailing list