[ExM Users] scaling Turbine on Vesta

Tim Armstrong tim.g.armstrong at gmail.com
Tue Apr 15 13:20:49 CDT 2014


Awesome, glad to hear.

I feel like it would be nice to have some lightweight fork mechanism to be
able to get some isolation of user processes for situations like yours,
since user code doesn't always behave nicely...

- Tim


On Tue, Apr 15, 2014 at 9:22 AM, Ketan Maheshwari <ketan at mcs.anl.gov> wrote:

> Hi Tim,
>
> I think I found the issue and got past it. In my C code, I forgot to close
> a file. Now in a new version the file gets closed after read. And this one
> seems to be scaling well. So far, on Vesta, I was able to scale to 10K
> processes on 625 nodes without any issue.
>
> Thanks,
> Ketan
>
>
> On Mon, Apr 14, 2014 at 8:45 PM, Tim Armstrong <tim.g.armstrong at gmail.com>wrote:
>
>>  It's hard to narrow it down from the info - that script seems fairly
>> unlikely to cause problems.
>>
>> What optimisation level? STC/Turbine version? How many processes?  How
>> many ADLB servers?  Is it every time you run or just intermittently?
>>
>> Can you confirm that it's not just getting stuck in the leaf function as
>> well?  E.g. log when it enters and exits.
>>
>> There is a rare race condition that can deadlock things that I'm just
>> working on now, but it seems unlikely that you would be encountering that
>> with that script.
>>
>>  - Tim
>>
>>
>> On Mon, Apr 14, 2014 at 6:09 PM, Ketan Maheshwari <ketan at mcs.anl.gov>wrote:
>>
>>> Hi,
>>>
>>>  Trying to scale up a simple leaf function on Vesta. It seems that the
>>> leaf function runs at max 259 times and beyond that either it does not
>>> return any results or crash, but I do not see any error messages or other
>>> indications either.
>>>
>>>  On Vesta, an example is
>>> at /home/ketan/turbine-output/2014/04/14/23/04/06
>>>
>>>  Any clue on this?
>>>
>>>  The Swift source looks as follows:
>>>
>>>  import io;
>>>
>>>  @dispatch=WORKER
>>> (int v) leaf_main(string A[]) "leaf_main" "0.0" "leaf_main_wrap";
>>> main
>>> {
>>>   int rc[];
>>>   foreach i in [0:9999:1]{
>>>     rc[i] = leaf_main([fromint(i)]);
>>>   }
>>> }
>>>
>>>
>>>  Thanks,
>>> Ketan
>>>
>>> _______________________________________________
>>> ExM-user mailing list
>>> ExM-user at lists.mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/exm-user
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/exm-user/attachments/20140415/f29fd7a5/attachment.html>


More information about the ExM-user mailing list