Hi Mike,<div><br></div><div>Seems it is resolved now. There were multiple issues:</div><div><br></div><div>In my config file use provider staging was set to true and in sites file staging method was set to file. This was conflicting with cdm link creation because the file with link name was already present. This was resolved by setting the above option to false and removing the staging method line from sites.xml</div>
<div><br></div><div>Turns out that Mars only works when the licence file is present in the same dir as data. It does not like licence file symlinked for some reason. So, it had to be excluded from getting cdm'd. I use individual patterns to cdm inputs.</div>
<div><br></div><div>In one of the configuration, where I set all my output file mappings to absolute paths in source swift script as well as mappers.sh, I was getting falsely successful jobs: swift did not complain but only blank output files were touch'd (by cdm?). It complained in the end when the files were not found to the last job which accepts them as input.</div>
<div><br></div><div>Another issue was with the workdir in my sites.xml. It was a relative path in mine whereas was absolute path in your case. Swift complained with exit status 127 in my case and worked when I provide absolute path. I am not sure if this was trunk or 0.93.1. I will check again.</div>
<div>
<br></div><div>In an earlier issue where I mentioned Swift not starting the number of parallel jobs for local provider corresponding to the jobthrottle value, I observe that indeed this is true for the local provider but does not seem to be true when using coasters *locally*. Consequently, I tried both approaches on a 32-core machine and found that in the case of coaster provider the performance was better compared to the local provider *with* CDM (Although only the inputs were cdm'd: 7M per job). Here are the results for different throttle values (intended to use different number of cpus) with coasters:</div>
<div><br></div><div>8 cores -- 13m 25sec</div><div>16 cores -- 12m 40sec</div><div>24 cores -- 10m 51sec</div><div>32 cores -- 10m 57sec</div><div><br></div><div>With local provider, some inputs cdm'd:</div><div><br></div>
<div>8 cores -- 15m 8sec</div><div>16 cores -- 12m 4sec</div><div>24 cores -- 12m 37sec</div><div>32 cores -- 11m 39sec</div><div><br></div><div>It looks like coaster provider does not take the datamovement to jobs ratio into account and in this case it turns out to be faster.</div>
<div><br></div><div>I observe that local provider starts with a much less number of jobs and slowly picks up with more jobs and reached the peak intended number almost always after 25% of jobs completes.</div><div><br></div>
<div>Regards,</div><div>Ketan</div><div><br></div><div><div class="gmail_quote">On Tue, Oct 23, 2012 at 7:14 PM, Michael Wilde <span dir="ltr"><<a href="mailto:wilde@mcs.anl.gov" target="_blank">wilde@mcs.anl.gov</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
I just noticed your mention here of a "too many open files" problem.<br>
<br>
Can you tell me what "ulimit -n" (max # of open files) reports for your system?<br>
<br>
Can you alter your app script to return the 100+ files in a tarball instead of individually?<br>
<br>
What may be happening here is:<br>
<br>
- if you have low -n limit (eg 1024) and<br>
<br>
- you are using provider staging, meaning the swift or coaster service jvm will be writing the final output files directly and<br>
<br>
- you are writing 32 jobs x 100 files files concurrently then<br>
<br>
-> you will exceed your limit of open files.<br>
<br>
Just a hypothesis - you'll need to dig deeper and see if you can extend the ulimit for -n.<br>
<div><br>
- Mike<br>
<br>
----- Original Message -----<br>
> From: "Ketan Maheshwari" <<a href="mailto:ketancmaheshwari@gmail.com" target="_blank">ketancmaheshwari@gmail.com</a>><br>
</div><div>> To: "Michael Wilde" <<a href="mailto:wilde@mcs.anl.gov" target="_blank">wilde@mcs.anl.gov</a>><br>
> Cc: "Swift Devel" <<a href="mailto:swift-devel@ci.uchicago.edu" target="_blank">swift-devel@ci.uchicago.edu</a>><br>
</div><div><div>> Sent: Tuesday, October 23, 2012 2:02:15 PM<br>
> Subject: Re: [Swift-devel] jobthrottle value does not correspond to number of parallel jobs on local provider<br>
> Mike,<br>
><br>
><br>
> Thank you for your answers.<br>
><br>
><br>
> I tried catsnsleep with n=100 and s=10 and indeed the number of<br>
> parallel jobs corresponded to the jobthrottle value.<br>
> Surprisingly, when I started the mars application immediately after<br>
> this, it also started 32 jobs in parallel. However, the run failed<br>
> with "too many open files" error after a while.<br>
><br>
><br>
> Now, I am trying cdm method. Will keep you posted.<br>
><br>
><br>
> On Tue, Oct 23, 2012 at 2:36 PM, Michael Wilde < <a href="mailto:wilde@mcs.anl.gov" target="_blank">wilde@mcs.anl.gov</a> ><br>
> wrote:<br>
><br>
><br>
> Ketan, looking further I see that your app has a large number of<br>
> output files, O(100). Depending on their size, and the speed of the<br>
> filesystem on which you are testing, that re-inforces my suspicion<br>
> that low concurrency you are seeing is due to staging IO.<br>
><br>
> If this is a local 32-core host, try running with your input and<br>
> output data and workdirectory all on a local hard disk (or even<br>
> /dev/shm if it has sufficient RAM/space). Then try using CDM direct as<br>
> explained at:<br>
><br>
> <a href="http://www.ci.uchicago.edu/swift/guides/trunk/userguide/userguide.html#_specific_use_cases" target="_blank">http://www.ci.uchicago.edu/swift/guides/trunk/userguide/userguide.html#_specific_use_cases</a><br>
><br>
><br>
> - Mike<br>
><br>
> ----- Original Message -----<br>
><br>
><br>
> > From: "Michael Wilde" < <a href="mailto:wilde@mcs.anl.gov" target="_blank">wilde@mcs.anl.gov</a> ><br>
> > To: "Ketan Maheshwari" < <a href="mailto:ketancmaheshwari@gmail.com" target="_blank">ketancmaheshwari@gmail.com</a> ><br>
> > Cc: "Swift Devel" < <a href="mailto:swift-devel@ci.uchicago.edu" target="_blank">swift-devel@ci.uchicago.edu</a> ><br>
> > Sent: Tuesday, October 23, 2012 12:23:34 PM<br>
> > Subject: Re: [Swift-devel] jobthrottle value does not correspond to<br>
> > number of parallel jobs on local provider<br>
> > Hi Ketan,<br>
> ><br>
> > In the log you attached I see this:<br>
> ><br>
> > <profile key="jobThrottle" namespace="karajan">0.10</profile><br>
> > <profile namespace="karajan" key="initialScore">100000</profile><br>
> ><br>
> > You should leave initialScore constant, and set to a large number,<br>
> > no<br>
> > matter what level of manual throttling you want to specify via<br>
> > sites.xml. We always use 10000 for this value. Don't attempt to vary<br>
> > the initialScore value for manual throttle: just use jobThrottle to<br>
> > set what you want.<br>
> ><br>
> > A jobThrottle value of 0.10 should run 11 jobs in parallel<br>
> > (jobThrottle * 100) + 1 (for historical reasons related to the<br>
> > automatic throttling algorithm).<br>
> ><br>
> > If you are seeing less than that, one common cause is that the ratio<br>
> > of your input staging times to your job run times is so high as to<br>
> > make it impossible for Swift to keep the expected/desired number of<br>
> > jobs in active state at once.<br>
> ><br>
> > I suggest you test the throttle behavior with a simple app script<br>
> > like<br>
> > "catsnsleep" (catsn with an artificial sleep to increase job<br>
> > duration). If your settings (sites + cf) work for that test, then<br>
> > they<br>
> > should work for the real app, within the staging constraints. Using<br>
> > CDM "direct" mode is likely what you want here to eliminate<br>
> > unnecessary staging on a local cluster.<br>
> ><br>
> > In your test, what was this ratio? Can you also post your cf file<br>
> > and<br>
> > the progress log from stdout/stderr?<br>
> ><br>
> > - Mike<br>
> ><br>
> > ----- Original Message -----<br>
> > > From: "Ketan Maheshwari" < <a href="mailto:ketancmaheshwari@gmail.com" target="_blank">ketancmaheshwari@gmail.com</a> ><br>
> > > To: "Swift Devel" < <a href="mailto:swift-devel@ci.uchicago.edu" target="_blank">swift-devel@ci.uchicago.edu</a> ><br>
> > > Sent: Tuesday, October 23, 2012 10:34:25 AM<br>
> > > Subject: [Swift-devel] jobthrottle value does not correspond to<br>
> > > number of parallel jobs on local provider<br>
> > > Hi,<br>
> > ><br>
> > ><br>
> > > I am trying to run an experiment on a 32-core machine with the<br>
> > > hope<br>
> > > of<br>
> > > running 8, 16, 24 and 32 jobs in parallel. I am trying to control<br>
> > > these numbers of parallel jobs by setting the Karajan jobthrottle<br>
> > > values in sites.xml to 0.07, 0.15, and so on.<br>
> > ><br>
> > ><br>
> > > However, it seems that the values are not corresponding to what I<br>
> > > see<br>
> > > in the Swift progress text.<br>
> > ><br>
> > ><br>
> > > Initially, when I set jobthrottle to 0.07, only 2 jobs started in<br>
> > > parallel. Then I added the line setting "Initialscore" value to<br>
> > > 10000,<br>
> > > which improved the jobs to 5. After this a 10-fold increase in<br>
> > > "initialscore" did not improve the jobs count.<br>
> > ><br>
> > ><br>
> > > Furthermore, a new batch of 5 jobs get started only when *all*<br>
> > > jobs<br>
> > > from the old batch are over as opposed to a continuous supply of<br>
> > > jobs<br>
> > > from "site selection" to "stage out" state which happens in the<br>
> > > case<br>
> > > of coaster and other providers.<br>
> > ><br>
> > ><br>
> > > The behavior is same in Swift 0.93.1 and latest trunk.<br>
> > ><br>
> > ><br>
> > ><br>
> > > Thank you for any clues on how to set the expected number of<br>
> > > parallel<br>
> > > jobs to these values.<br>
> > ><br>
> > ><br>
> > > Please find attached one such log of this run.<br>
> > > Thanks, --<br>
> > > Ketan<br>
> > ><br>
> > ><br>
> > ><br>
> > > _______________________________________________<br>
> > > Swift-devel mailing list<br>
> > > <a href="mailto:Swift-devel@ci.uchicago.edu" target="_blank">Swift-devel@ci.uchicago.edu</a><br>
> > > <a href="https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel" target="_blank">https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel</a><br>
> ><br>
> > --<br>
> > Michael Wilde<br>
> > Computation Institute, University of Chicago<br>
> > Mathematics and Computer Science Division<br>
> > Argonne National Laboratory<br>
> ><br>
> > _______________________________________________<br>
> > Swift-devel mailing list<br>
> > <a href="mailto:Swift-devel@ci.uchicago.edu" target="_blank">Swift-devel@ci.uchicago.edu</a><br>
> > <a href="https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel" target="_blank">https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel</a><br>
><br>
> --<br>
> Michael Wilde<br>
> Computation Institute, University of Chicago<br>
> Mathematics and Computer Science Division<br>
> Argonne National Laboratory<br>
><br>
><br>
><br>
><br>
><br>
> --<br>
> Ketan<br>
<br>
--<br>
Michael Wilde<br>
Computation Institute, University of Chicago<br>
Mathematics and Computer Science Division<br>
Argonne National Laboratory<br>
<br>
</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><font face="'courier new', monospace">Ketan</font><br><br><br>
</div>