Breaking the 4GB barrier with Pharo 6 64-bit

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
57 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Re: Breaking the 4GB barrier with Pharo 6 64-bit

Thierry Goubier
Le 14/11/2016 à 21:51, stepharo a écrit :
> Hi thierry
>
>
> did you happen to have a techreport or any description of your work?

It's a chapter in my PhD thesis ;)

Thierry

>
> Stef
>
>
> Le 11/11/16 à 11:44, Thierry Goubier a écrit :
>> Le 11/11/2016 à 11:29, Stephan Eggermont a écrit :
>>> On 10/11/16 21:35, Igor Stasenko wrote:
>>>> No, no, no! This is simply not true.
>>>> It is you, who writes the code that generates a lot of statistical
>>>> data/analysis data, and its output is fairly predictable.. else you are
>>>> not collecting any data, but just a random noise, isn't?
>>>
>>> That would be green field development. In brown field development, I
>>> only get in when people start noticing there is a problem (why do we
>>> need more than 4GBytes for this?). At that point I want to be able to
>>> load everything they can give me in an image so I can start analyzing
>>> and structuring it.
>>>
>>>> I mean, Doru is light years ahead of me and many others in field of
>>>> data
>>>> analysis.. so what i can advise to him on his playground?
>>>
>>> Well, the current FAMIX model implementation is clearly not well
>>> structured for analyzing large code bases. And it is difficult to
>>> partition because of unpredictable access patterns and high
>>> interconnection.
>>
>> This is why you look for a general purpose, efficient off-loading
>> scheme, trying to optimize a general case and get reasonable
>> performance out of it (a.k.a fuel, but designed for partial unloading
>> / loading: allow dangling references in a unit of load, focus on
>> per-page units to match the underlying storage layer or network).
>>
>> I wrote one such layer for VW a long time ago, but didn't had time to
>> experiment / qualify some of the techniques in it. There was an
>> interesting attempt (IMHO ... wasn't qualified) at combining paging
>> and automatic refinement of application working set, based on previous
>> experience implementing a hierarchical 2D object access scheme for
>> large datasets on slow medium (decreased access time from 30 minutes
>> to about a few seconds).
>>
>> The other approach I would look is take some of the support code for
>> such an automatic layer and use it to unload parts of my model;, and
>> I'm pretty sure that, if I don't bench intensively, I'll get the
>> partitioning wrong :(
>>
>> Overall, an interesting subject, totally not valid from a scientific
>> point of view (the database guys have already solved everything). Only
>> valid as a hobby, or if a company is ready to pay for a solution.
>>
>> Thierry
>>
>>
>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Breaking the 4GB barrier with Pharo 6 64-bit

Eliot Miranda-2
In reply to this post by Denis Kudriashov


On Thu, Nov 10, 2016 at 1:31 AM, Denis Kudriashov <[hidden email]> wrote:

2016-11-10 9:49 GMT+01:00 [hidden email] <[hidden email]>:
Ah, but then it may be more interesting to have a data image (maybe a lot of these) and a front end image.

Isn't Seamless something that could help us here? No need to bring the data back, just manipulate it through proxies.

Problem that server image will anyway perform GC. And it will be slow if server image is big which will stop all world.

Which is why we plan on implementing an incremental global GC that will not stop the world, but will divide global GC up into many small steps, each of which will be shorter than 10 milliseconds, and so not be noticeable. 

_,,,^..^,,,_
best, Eliot
Reply | Threaded
Open this post in threaded view
|

Re: Breaking the 4GB barrier with Pharo 6 64-bit

Eliot Miranda-2
In reply to this post by philippeback
Hi Phil,

On Thu, Nov 10, 2016 at 2:19 AM, [hidden email] <[hidden email]> wrote:


On Thu, Nov 10, 2016 at 10:31 AM, Denis Kudriashov <[hidden email]> wrote:

2016-11-10 9:49 GMT+01:00 [hidden email] <[hidden email]>:
Ah, but then it may be more interesting to have a data image (maybe a lot of these) and a front end image.

Isn't Seamless something that could help us here? No need to bring the data back, just manipulate it through proxies.

Problem that server image will anyway perform GC. And it will be slow if server image is big which will stop all world.

What if we asked it to not do any GC at all? Like if we have tons of RAM, why bother? Especially if what it is used to is to keep datasets: load them, save image to disk. When needed trash the loaded stuff and reload from zero.

Basically that is what happens with Spark.


While global GC may not be useful for big-data scavenging probably will be for any non-trivial query.  But I think I see a misconception here.  The large RAM on a multiword machine would be divided up between the cores.  It makes no sense to run a single Smalltalk across lots of cores (we're a long way from having a thread-safe class library).  It makes much more sense to have one Smalltalk per core.  So that brings the heap sizes down and makes GC less scary.
 
and Tachyon/Alluxio is kind of solving this kind of issue (may be nice to have that interacting with Pharo image). http://www.alluxio.org/ This thing basically keeps stuff in memory in case one needs to reuse the data between workload runs.

Sure.  We have all the facilities we need to do this.  We can add and remove code at runtime so we can keep live instances running, and send the code to them along with the data we want them to crunch.
 

Or have an object memory for work and one for datasets (first one gets GC'd, the other one isn't).

Or have policies which one can switch.  There are quite a few levers into the GC from the image and one can easily switch off global GC with the right levers.  One doesn't need a VM that doesn't contain a GC.  One needs an image that is using the right policy.

Phil

_,,,^..^,,,_
best, Eliot
Reply | Threaded
Open this post in threaded view
|

Re: Breaking the 4GB barrier with Pharo 6 64-bit

Igor Stasenko


On 15 November 2016 at 02:18, Eliot Miranda <[hidden email]> wrote:
Hi Phil,

On Thu, Nov 10, 2016 at 2:19 AM, [hidden email] <[hidden email]> wrote:


On Thu, Nov 10, 2016 at 10:31 AM, Denis Kudriashov <[hidden email]> wrote:

2016-11-10 9:49 GMT+01:00 [hidden email] <[hidden email]>:
Ah, but then it may be more interesting to have a data image (maybe a lot of these) and a front end image.

Isn't Seamless something that could help us here? No need to bring the data back, just manipulate it through proxies.

Problem that server image will anyway perform GC. And it will be slow if server image is big which will stop all world.

What if we asked it to not do any GC at all? Like if we have tons of RAM, why bother? Especially if what it is used to is to keep datasets: load them, save image to disk. When needed trash the loaded stuff and reload from zero.

Basically that is what happens with Spark.


While global GC may not be useful for big-data scavenging probably will be for any non-trivial query.  But I think I see a misconception here.  The large RAM on a multiword machine would be divided up between the cores.  It makes no sense to run a single Smalltalk across lots of cores (we're a long way from having a thread-safe class library).  It makes much more sense to have one Smalltalk per core.  So that brings the heap sizes down and makes GC less scary.

yep, that approach what we're tried in HydraVM 
 
 
and Tachyon/Alluxio is kind of solving this kind of issue (may be nice to have that interacting with Pharo image). http://www.alluxio.org/ This thing basically keeps stuff in memory in case one needs to reuse the data between workload runs.

Sure.  We have all the facilities we need to do this.  We can add and remove code at runtime so we can keep live instances running, and send the code to them along with the data we want them to crunch.
 

Or have an object memory for work and one for datasets (first one gets GC'd, the other one isn't).

Or have policies which one can switch.  There are quite a few levers into the GC from the image and one can easily switch off global GC with the right levers.  One doesn't need a VM that doesn't contain a GC.  One needs an image that is using the right policy.

or just mark whole data (sub)graphs with some bit, telling GC to skip over this so it won't attempt to scan it treating them as always alive..
this is where we getting back to my idea of heap spaces, where you can toss a subgraph into a special heap space that has such policy, that it is never scanned/GCed automatically and can be triggered only manually or something like that.
 

Phil

_,,,^..^,,,_
best, Eliot



--
Best regards,
Igor Stasenko.
Reply | Threaded
Open this post in threaded view
|

Re: Breaking the 4GB barrier with Pharo 6 64-bit

philippeback


On Tue, Nov 22, 2016 at 5:57 PM, Igor Stasenko <[hidden email]> wrote:


On 15 November 2016 at 02:18, Eliot Miranda <[hidden email]> wrote:
Hi Phil,

On Thu, Nov 10, 2016 at 2:19 AM, [hidden email] <[hidden email]> wrote:


On Thu, Nov 10, 2016 at 10:31 AM, Denis Kudriashov <[hidden email]> wrote:

2016-11-10 9:49 GMT+01:00 [hidden email] <[hidden email]>:
Ah, but then it may be more interesting to have a data image (maybe a lot of these) and a front end image.

Isn't Seamless something that could help us here? No need to bring the data back, just manipulate it through proxies.

Problem that server image will anyway perform GC. And it will be slow if server image is big which will stop all world.

What if we asked it to not do any GC at all? Like if we have tons of RAM, why bother? Especially if what it is used to is to keep datasets: load them, save image to disk. When needed trash the loaded stuff and reload from zero.

Basically that is what happens with Spark.


While global GC may not be useful for big-data scavenging probably will be for any non-trivial query.  But I think I see a misconception here.  The large RAM on a multiword machine would be divided up between the cores.  It makes no sense to run a single Smalltalk across lots of cores (we're a long way from having a thread-safe class library).  It makes much more sense to have one Smalltalk per core.  So that brings the heap sizes down and makes GC less scary.

yep, that approach what we're tried in HydraVM 
 
 
and Tachyon/Alluxio is kind of solving this kind of issue (may be nice to have that interacting with Pharo image). http://www.alluxio.org/ This thing basically keeps stuff in memory in case one needs to reuse the data between workload runs.

Sure.  We have all the facilities we need to do this.  We can add and remove code at runtime so we can keep live instances running, and send the code to them along with the data we want them to crunch.
 

Or have an object memory for work and one for datasets (first one gets GC'd, the other one isn't).

Or have policies which one can switch.  There are quite a few levers into the GC from the image and one can easily switch off global GC with the right levers.  One doesn't need a VM that doesn't contain a GC.  One needs an image that is using the right policy.

or just mark whole data (sub)graphs with some bit, telling GC to skip over this so it won't attempt to scan it treating them as always alive..
this is where we getting back to my idea of heap spaces, where you can toss a subgraph into a special heap space that has such policy, that it is never scanned/GCed automatically and can be triggered only manually or something like that.

Could be very useful for all kinds of large binary data, like videos and sounds that we can load once and keep in the heap space.

How hard would it be to get something like that?

Phil
 
 

Phil

_,,,^..^,,,_
best, Eliot



--
Best regards,
Igor Stasenko.

Reply | Threaded
Open this post in threaded view
|

Re: Breaking the 4GB barrier with Pharo 6 64-bit

Sven Van Caekenberghe-2

> On 22 Nov 2016, at 19:16, [hidden email] wrote:
>
>
>
> On Tue, Nov 22, 2016 at 5:57 PM, Igor Stasenko <[hidden email]> wrote:
>
>
> On 15 November 2016 at 02:18, Eliot Miranda <[hidden email]> wrote:
> Hi Phil,
>
> On Thu, Nov 10, 2016 at 2:19 AM, [hidden email] <[hidden email]> wrote:
>
>
> On Thu, Nov 10, 2016 at 10:31 AM, Denis Kudriashov <[hidden email]> wrote:
>
> 2016-11-10 9:49 GMT+01:00 [hidden email] <[hidden email]>:
> Ah, but then it may be more interesting to have a data image (maybe a lot of these) and a front end image.
>
> Isn't Seamless something that could help us here? No need to bring the data back, just manipulate it through proxies.
>
> Problem that server image will anyway perform GC. And it will be slow if server image is big which will stop all world.
>
> What if we asked it to not do any GC at all? Like if we have tons of RAM, why bother? Especially if what it is used to is to keep datasets: load them, save image to disk. When needed trash the loaded stuff and reload from zero.
>
> Basically that is what happens with Spark.
>
> http://sujee.net/2015/01/22/understanding-spark-caching/#.WCRIgy0rKpo
> https://0x0fff.com/spark-misconceptions/
>
> While global GC may not be useful for big-data scavenging probably will be for any non-trivial query.  But I think I see a misconception here.  The large RAM on a multiword machine would be divided up between the cores.  It makes no sense to run a single Smalltalk across lots of cores (we're a long way from having a thread-safe class library).  It makes much more sense to have one Smalltalk per core.  So that brings the heap sizes down and makes GC less scary.
>
> yep, that approach what we're tried in HydraVM
>  
>  
> and Tachyon/Alluxio is kind of solving this kind of issue (may be nice to have that interacting with Pharo image). http://www.alluxio.org/ This thing basically keeps stuff in memory in case one needs to reuse the data between workload runs.
>
> Sure.  We have all the facilities we need to do this.  We can add and remove code at runtime so we can keep live instances running, and send the code to them along with the data we want them to crunch.
>  
>
> Or have an object memory for work and one for datasets (first one gets GC'd, the other one isn't).
>
> Or have policies which one can switch.  There are quite a few levers into the GC from the image and one can easily switch off global GC with the right levers.  One doesn't need a VM that doesn't contain a GC.  One needs an image that is using the right policy.
>
> or just mark whole data (sub)graphs with some bit, telling GC to skip over this so it won't attempt to scan it treating them as always alive..
> this is where we getting back to my idea of heap spaces, where you can toss a subgraph into a special heap space that has such policy, that it is never scanned/GCed automatically and can be triggered only manually or something like that.
>
> Could be very useful for all kinds of large binary data, like videos and sounds that we can load once and keep in the heap space.
>
> How hard would it be to get something like that?

Large binary data poses no problem (as long as it's not a copying GC). Since a binary blob contains no subpointers, no work needs to be done. A 1M or 1G ByteArray is the same amount of GC work.

> Phil
>  
>  
>
> Phil
>
> _,,,^..^,,,_
> best, Eliot
>
>
>
> --
> Best regards,
> Igor Stasenko.


Reply | Threaded
Open this post in threaded view
|

Re: Breaking the 4GB barrier with Pharo 6 64-bit

Eliot Miranda-2


On Tue, Nov 22, 2016 at 10:26 AM, Sven Van Caekenberghe <[hidden email]> wrote:

> On 22 Nov 2016, at 19:16, [hidden email] wrote:
>
>
>
> On Tue, Nov 22, 2016 at 5:57 PM, Igor Stasenko <[hidden email]> wrote:
>
>
> On 15 November 2016 at 02:18, Eliot Miranda <[hidden email]> wrote:
> Hi Phil,
>
> On Thu, Nov 10, 2016 at 2:19 AM, [hidden email] <[hidden email]> wrote:
>
>
> On Thu, Nov 10, 2016 at 10:31 AM, Denis Kudriashov <[hidden email]> wrote:
>
> 2016-11-10 9:49 GMT+01:00 [hidden email] <[hidden email]>:
> Ah, but then it may be more interesting to have a data image (maybe a lot of these) and a front end image.
>
> Isn't Seamless something that could help us here? No need to bring the data back, just manipulate it through proxies.
>
> Problem that server image will anyway perform GC. And it will be slow if server image is big which will stop all world.
>
> What if we asked it to not do any GC at all? Like if we have tons of RAM, why bother? Especially if what it is used to is to keep datasets: load them, save image to disk. When needed trash the loaded stuff and reload from zero.
>
> Basically that is what happens with Spark.
>
> http://sujee.net/2015/01/22/understanding-spark-caching/#.WCRIgy0rKpo
> https://0x0fff.com/spark-misconceptions/
>
> While global GC may not be useful for big-data scavenging probably will be for any non-trivial query.  But I think I see a misconception here.  The large RAM on a multiword machine would be divided up between the cores.  It makes no sense to run a single Smalltalk across lots of cores (we're a long way from having a thread-safe class library).  It makes much more sense to have one Smalltalk per core.  So that brings the heap sizes down and makes GC less scary.
>
> yep, that approach what we're tried in HydraVM
>
>
> and Tachyon/Alluxio is kind of solving this kind of issue (may be nice to have that interacting with Pharo image). http://www.alluxio.org/ This thing basically keeps stuff in memory in case one needs to reuse the data between workload runs.
>
> Sure.  We have all the facilities we need to do this.  We can add and remove code at runtime so we can keep live instances running, and send the code to them along with the data we want them to crunch.
>
>
> Or have an object memory for work and one for datasets (first one gets GC'd, the other one isn't).
>
> Or have policies which one can switch.  There are quite a few levers into the GC from the image and one can easily switch off global GC with the right levers.  One doesn't need a VM that doesn't contain a GC.  One needs an image that is using the right policy.
>
> or just mark whole data (sub)graphs with some bit, telling GC to skip over this so it won't attempt to scan it treating them as always alive..
> this is where we getting back to my idea of heap spaces, where you can toss a subgraph into a special heap space that has such policy, that it is never scanned/GCed automatically and can be triggered only manually or something like that.
>
> Could be very useful for all kinds of large binary data, like videos and sounds that we can load once and keep in the heap space.
>
> How hard would it be to get something like that?

Large binary data poses no problem (as long as it's not a copying GC). Since a binary blob contains no subpointers, no work needs to be done. A 1M or 1G ByteArray is the same amount of GC work.

+1

_,,,^..^,,,_
best, Eliot
Reply | Threaded
Open this post in threaded view
|

Re: Breaking the 4GB barrier with Pharo 6 64-bit

philippeback


On Wed, Nov 23, 2016 at 12:53 AM, Eliot Miranda <[hidden email]> wrote:


On Tue, Nov 22, 2016 at 10:26 AM, Sven Van Caekenberghe <[hidden email]> wrote:

> On 22 Nov 2016, at 19:16, [hidden email] wrote:
>
>
>
> On Tue, Nov 22, 2016 at 5:57 PM, Igor Stasenko <[hidden email]> wrote:
>
>
> On 15 November 2016 at 02:18, Eliot Miranda <[hidden email]> wrote:
> Hi Phil,
>
> On Thu, Nov 10, 2016 at 2:19 AM, [hidden email] <[hidden email]> wrote:
>
>
> On Thu, Nov 10, 2016 at 10:31 AM, Denis Kudriashov <[hidden email]> wrote:
>
> 2016-11-10 9:49 GMT+01:00 [hidden email] <[hidden email]>:
> Ah, but then it may be more interesting to have a data image (maybe a lot of these) and a front end image.
>
> Isn't Seamless something that could help us here? No need to bring the data back, just manipulate it through proxies.
>
> Problem that server image will anyway perform GC. And it will be slow if server image is big which will stop all world.
>
> What if we asked it to not do any GC at all? Like if we have tons of RAM, why bother? Especially if what it is used to is to keep datasets: load them, save image to disk. When needed trash the loaded stuff and reload from zero.
>
> Basically that is what happens with Spark.
>
> http://sujee.net/2015/01/22/understanding-spark-caching/#.WCRIgy0rKpo
> https://0x0fff.com/spark-misconceptions/
>
> While global GC may not be useful for big-data scavenging probably will be for any non-trivial query.  But I think I see a misconception here.  The large RAM on a multiword machine would be divided up between the cores.  It makes no sense to run a single Smalltalk across lots of cores (we're a long way from having a thread-safe class library).  It makes much more sense to have one Smalltalk per core.  So that brings the heap sizes down and makes GC less scary.
>
> yep, that approach what we're tried in HydraVM
>
>
> and Tachyon/Alluxio is kind of solving this kind of issue (may be nice to have that interacting with Pharo image). http://www.alluxio.org/ This thing basically keeps stuff in memory in case one needs to reuse the data between workload runs.
>
> Sure.  We have all the facilities we need to do this.  We can add and remove code at runtime so we can keep live instances running, and send the code to them along with the data we want them to crunch.
>
>
> Or have an object memory for work and one for datasets (first one gets GC'd, the other one isn't).
>
> Or have policies which one can switch.  There are quite a few levers into the GC from the image and one can easily switch off global GC with the right levers.  One doesn't need a VM that doesn't contain a GC.  One needs an image that is using the right policy.
>
> or just mark whole data (sub)graphs with some bit, telling GC to skip over this so it won't attempt to scan it treating them as always alive..
> this is where we getting back to my idea of heap spaces, where you can toss a subgraph into a special heap space that has such policy, that it is never scanned/GCed automatically and can be triggered only manually or something like that.
>
> Could be very useful for all kinds of large binary data, like videos and sounds that we can load once and keep in the heap space.
>
> How hard would it be to get something like that?

Large binary data poses no problem (as long as it's not a copying GC). Since a binary blob contains no subpointers, no work needs to be done. A 1M or 1G ByteArray is the same amount of GC work.

+1

Amen to that. But a dataset made of a gazillion of composites is not the same, right?

Phil 

_,,,^..^,,,_
best, Eliot

Reply | Threaded
Open this post in threaded view
|

Re: Breaking the 4GB barrier with Pharo 6 64-bit

Igor Stasenko


On 23 November 2016 at 10:50, [hidden email] <[hidden email]> wrote:


On Wed, Nov 23, 2016 at 12:53 AM, Eliot Miranda <[hidden email]> wrote:


On Tue, Nov 22, 2016 at 10:26 AM, Sven Van Caekenberghe <[hidden email]> wrote:

> On 22 Nov 2016, at 19:16, [hidden email] wrote:
>
>
>
> On Tue, Nov 22, 2016 at 5:57 PM, Igor Stasenko <[hidden email]> wrote:
>
>
> On 15 November 2016 at 02:18, Eliot Miranda <[hidden email]> wrote:
> Hi Phil,
>
> On Thu, Nov 10, 2016 at 2:19 AM, [hidden email] <[hidden email]> wrote:
>
>
> On Thu, Nov 10, 2016 at 10:31 AM, Denis Kudriashov <[hidden email]> wrote:
>
> 2016-11-10 9:49 GMT+01:00 [hidden email] <[hidden email]>:
> Ah, but then it may be more interesting to have a data image (maybe a lot of these) and a front end image.
>
> Isn't Seamless something that could help us here? No need to bring the data back, just manipulate it through proxies.
>
> Problem that server image will anyway perform GC. And it will be slow if server image is big which will stop all world.
>
> What if we asked it to not do any GC at all? Like if we have tons of RAM, why bother? Especially if what it is used to is to keep datasets: load them, save image to disk. When needed trash the loaded stuff and reload from zero.
>
> Basically that is what happens with Spark.
>
> http://sujee.net/2015/01/22/understanding-spark-caching/#.WCRIgy0rKpo
> https://0x0fff.com/spark-misconceptions/
>
> While global GC may not be useful for big-data scavenging probably will be for any non-trivial query.  But I think I see a misconception here.  The large RAM on a multiword machine would be divided up between the cores.  It makes no sense to run a single Smalltalk across lots of cores (we're a long way from having a thread-safe class library).  It makes much more sense to have one Smalltalk per core.  So that brings the heap sizes down and makes GC less scary.
>
> yep, that approach what we're tried in HydraVM
>
>
> and Tachyon/Alluxio is kind of solving this kind of issue (may be nice to have that interacting with Pharo image). http://www.alluxio.org/ This thing basically keeps stuff in memory in case one needs to reuse the data between workload runs.
>
> Sure.  We have all the facilities we need to do this.  We can add and remove code at runtime so we can keep live instances running, and send the code to them along with the data we want them to crunch.
>
>
> Or have an object memory for work and one for datasets (first one gets GC'd, the other one isn't).
>
> Or have policies which one can switch.  There are quite a few levers into the GC from the image and one can easily switch off global GC with the right levers.  One doesn't need a VM that doesn't contain a GC.  One needs an image that is using the right policy.
>
> or just mark whole data (sub)graphs with some bit, telling GC to skip over this so it won't attempt to scan it treating them as always alive..
> this is where we getting back to my idea of heap spaces, where you can toss a subgraph into a special heap space that has such policy, that it is never scanned/GCed automatically and can be triggered only manually or something like that.
>
> Could be very useful for all kinds of large binary data, like videos and sounds that we can load once and keep in the heap space.
>
> How hard would it be to get something like that?

Large binary data poses no problem (as long as it's not a copying GC). Since a binary blob contains no subpointers, no work needs to be done. A 1M or 1G ByteArray is the same amount of GC work.

+1

Amen to that. But a dataset made of a gazillion of composites is not the same, right?

yep, as soon as you have references in your data, you add more work for GC
 
Phil 

_,,,^..^,,,_
best, Eliot




--
Best regards,
Igor Stasenko.
Reply | Threaded
Open this post in threaded view
|

Re: Breaking the 4GB barrier with Pharo 6 64-bit

philippeback


On Wed, Nov 23, 2016 at 10:51 AM, Igor Stasenko <[hidden email]> wrote:


On 23 November 2016 at 10:50, [hidden email] <[hidden email]> wrote:


On Wed, Nov 23, 2016 at 12:53 AM, Eliot Miranda <[hidden email]> wrote:


On Tue, Nov 22, 2016 at 10:26 AM, Sven Van Caekenberghe <[hidden email]> wrote:

> On 22 Nov 2016, at 19:16, [hidden email] wrote:
>
>
>
> On Tue, Nov 22, 2016 at 5:57 PM, Igor Stasenko <[hidden email]> wrote:
>
>
> On 15 November 2016 at 02:18, Eliot Miranda <[hidden email]> wrote:
> Hi Phil,
>
> On Thu, Nov 10, 2016 at 2:19 AM, [hidden email] <[hidden email]> wrote:
>
>
> On Thu, Nov 10, 2016 at 10:31 AM, Denis Kudriashov <[hidden email]> wrote:
>
> 2016-11-10 9:49 GMT+01:00 [hidden email] <[hidden email]>:
> Ah, but then it may be more interesting to have a data image (maybe a lot of these) and a front end image.
>
> Isn't Seamless something that could help us here? No need to bring the data back, just manipulate it through proxies.
>
> Problem that server image will anyway perform GC. And it will be slow if server image is big which will stop all world.
>
> What if we asked it to not do any GC at all? Like if we have tons of RAM, why bother? Especially if what it is used to is to keep datasets: load them, save image to disk. When needed trash the loaded stuff and reload from zero.
>
> Basically that is what happens with Spark.
>
> http://sujee.net/2015/01/22/understanding-spark-caching/#.WCRIgy0rKpo
> https://0x0fff.com/spark-misconceptions/
>
> While global GC may not be useful for big-data scavenging probably will be for any non-trivial query.  But I think I see a misconception here.  The large RAM on a multiword machine would be divided up between the cores.  It makes no sense to run a single Smalltalk across lots of cores (we're a long way from having a thread-safe class library).  It makes much more sense to have one Smalltalk per core.  So that brings the heap sizes down and makes GC less scary.
>
> yep, that approach what we're tried in HydraVM
>
>
> and Tachyon/Alluxio is kind of solving this kind of issue (may be nice to have that interacting with Pharo image). http://www.alluxio.org/ This thing basically keeps stuff in memory in case one needs to reuse the data between workload runs.
>
> Sure.  We have all the facilities we need to do this.  We can add and remove code at runtime so we can keep live instances running, and send the code to them along with the data we want them to crunch.
>
>
> Or have an object memory for work and one for datasets (first one gets GC'd, the other one isn't).
>
> Or have policies which one can switch.  There are quite a few levers into the GC from the image and one can easily switch off global GC with the right levers.  One doesn't need a VM that doesn't contain a GC.  One needs an image that is using the right policy.
>
> or just mark whole data (sub)graphs with some bit, telling GC to skip over this so it won't attempt to scan it treating them as always alive..
> this is where we getting back to my idea of heap spaces, where you can toss a subgraph into a special heap space that has such policy, that it is never scanned/GCed automatically and can be triggered only manually or something like that.
>
> Could be very useful for all kinds of large binary data, like videos and sounds that we can load once and keep in the heap space.
>
> How hard would it be to get something like that?

Large binary data poses no problem (as long as it's not a copying GC). Since a binary blob contains no subpointers, no work needs to be done. A 1M or 1G ByteArray is the same amount of GC work.

+1

Amen to that. But a dataset made of a gazillion of composites is not the same, right?

yep, as soon as you have references in your data, you add more work for GC

That's what I tought. I have seen Craig Latta marking some objects with special flags in the object headers. Could there be some generic mechanism there now that we have 64-bit, super large headers? Like setting/resetting a kind of bitmask to let some spaces be GC'd or left alone? Things that we could manage image side?

(damn, I need more money in the bank to let me work on these things for a long stretch, it is so frustrating </end of rant>).

Phil

 
Phil 

_,,,^..^,,,_
best, Eliot




--
Best regards,
Igor Stasenko.

Reply | Threaded
Open this post in threaded view
|

Re: Breaking the 4GB barrier with Pharo 6 64-bit

Igor Stasenko


On 23 November 2016 at 12:41, [hidden email] <[hidden email]> wrote:


On Wed, Nov 23, 2016 at 10:51 AM, Igor Stasenko <[hidden email]> wrote:


On 23 November 2016 at 10:50, [hidden email] <[hidden email]> wrote:


On Wed, Nov 23, 2016 at 12:53 AM, Eliot Miranda <[hidden email]> wrote:


On Tue, Nov 22, 2016 at 10:26 AM, Sven Van Caekenberghe <[hidden email]> wrote:

> On 22 Nov 2016, at 19:16, [hidden email] wrote:
>
>
>
> On Tue, Nov 22, 2016 at 5:57 PM, Igor Stasenko <[hidden email]> wrote:
>
>
> On 15 November 2016 at 02:18, Eliot Miranda <[hidden email]> wrote:
> Hi Phil,
>
> On Thu, Nov 10, 2016 at 2:19 AM, [hidden email] <[hidden email]> wrote:
>
>
> On Thu, Nov 10, 2016 at 10:31 AM, Denis Kudriashov <[hidden email]> wrote:
>
> 2016-11-10 9:49 GMT+01:00 [hidden email] <[hidden email]>:
> Ah, but then it may be more interesting to have a data image (maybe a lot of these) and a front end image.
>
> Isn't Seamless something that could help us here? No need to bring the data back, just manipulate it through proxies.
>
> Problem that server image will anyway perform GC. And it will be slow if server image is big which will stop all world.
>
> What if we asked it to not do any GC at all? Like if we have tons of RAM, why bother? Especially if what it is used to is to keep datasets: load them, save image to disk. When needed trash the loaded stuff and reload from zero.
>
> Basically that is what happens with Spark.
>
> http://sujee.net/2015/01/22/understanding-spark-caching/#.WCRIgy0rKpo
> https://0x0fff.com/spark-misconceptions/
>
> While global GC may not be useful for big-data scavenging probably will be for any non-trivial query.  But I think I see a misconception here.  The large RAM on a multiword machine would be divided up between the cores.  It makes no sense to run a single Smalltalk across lots of cores (we're a long way from having a thread-safe class library).  It makes much more sense to have one Smalltalk per core.  So that brings the heap sizes down and makes GC less scary.
>
> yep, that approach what we're tried in HydraVM
>
>
> and Tachyon/Alluxio is kind of solving this kind of issue (may be nice to have that interacting with Pharo image). http://www.alluxio.org/ This thing basically keeps stuff in memory in case one needs to reuse the data between workload runs.
>
> Sure.  We have all the facilities we need to do this.  We can add and remove code at runtime so we can keep live instances running, and send the code to them along with the data we want them to crunch.
>
>
> Or have an object memory for work and one for datasets (first one gets GC'd, the other one isn't).
>
> Or have policies which one can switch.  There are quite a few levers into the GC from the image and one can easily switch off global GC with the right levers.  One doesn't need a VM that doesn't contain a GC.  One needs an image that is using the right policy.
>
> or just mark whole data (sub)graphs with some bit, telling GC to skip over this so it won't attempt to scan it treating them as always alive..
> this is where we getting back to my idea of heap spaces, where you can toss a subgraph into a special heap space that has such policy, that it is never scanned/GCed automatically and can be triggered only manually or something like that.
>
> Could be very useful for all kinds of large binary data, like videos and sounds that we can load once and keep in the heap space.
>
> How hard would it be to get something like that?

Large binary data poses no problem (as long as it's not a copying GC). Since a binary blob contains no subpointers, no work needs to be done. A 1M or 1G ByteArray is the same amount of GC work.

+1

Amen to that. But a dataset made of a gazillion of composites is not the same, right?

yep, as soon as you have references in your data, you add more work for GC

That's what I tought. I have seen Craig Latta marking some objects with special flags in the object headers. Could there be some generic mechanism there now that we have 64-bit, super large headers? Like setting/resetting a kind of bitmask to let some spaces be GC'd or left alone? Things that we could manage image side?

well, adding bit(s) is just a simplest part of story. the main one is implement GC discipline to not walk over marked object(s), but as well, is by having a mechanism to ensure that marked object(s) form a closed subgraph (i.e. there's no references coming outside of it)
scanning+marking a graph is usually a simple matter, you just need to provide a root(s). I had experiments with it in HydraVM, with a process we called mytosis - but it has slightly different purpose: 
- i implemented two primitives, the one that scans graph and reports if it fully isolated
and another one is to basically clone the graph into separate memory region to start it as an image in own thread etc.
But in our scenario, i imagine, that you cannot fully avoid external references - the most obvious one is instance->class references. In that case, we need some kind of mechanism to ensure that class objects that
referenced by object(s) in desired data set are kept in a system as long as our blob is unchanges. That could be solved by simply declaring a 'fixed' set of external references per a subgraph which live as a normal object(s) in a system, with only exception, like i mentioned, that it need to ensured it won't be GCed, or even better won't be moved as long as our isolated graph is in use.
Then the only what is left is to set the whole graph into read-only mode and you're ready to go..
And then, as you can imagine, having such mechanism opens even more interesting opportunities, like offloading graph on disk and/or (re)loading it on demand etc. Which is closely related to my flame-topic in this thread :)
But the point is, that identifying subgraph(s) and designating it cannot be automated - this will always be a responsibility of user(s), because only user knows best, what he wants to be used as a static data and what are not etc etc.
 
(damn, I need more money in the bank to let me work on these things for a long stretch, it is so frustrating </end of rant>).

Phil

 
Phil 

_,,,^..^,,,_
best, Eliot




--
Best regards,
Igor Stasenko.




--
Best regards,
Igor Stasenko.
Reply | Threaded
Open this post in threaded view
|

Re: Breaking the 4GB barrier with Pharo 6 64-bit

philippe.back@highoctane.be

Le 23 nov. 2016 12:07, "Igor Stasenko" <[hidden email]> a écrit :
>
>
>
> On 23 November 2016 at 12:41, [hidden email] <[hidden email]> wrote:
>>
>>
>>
>> On Wed, Nov 23, 2016 at 10:51 AM, Igor Stasenko <[hidden email]> wrote:
>>>
>>>
>>>
>>> On 23 November 2016 at 10:50, [hidden email] <[hidden email]> wrote:
>>>>
>>>>
>>>>
>>>> On Wed, Nov 23, 2016 at 12:53 AM, Eliot Miranda <[hidden email]> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Nov 22, 2016 at 10:26 AM, Sven Van Caekenberghe <[hidden email]> wrote:
>>>>>>
>>>>>>
>>>>>> > On 22 Nov 2016, at 19:16, [hidden email] wrote:
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > On Tue, Nov 22, 2016 at 5:57 PM, Igor Stasenko <[hidden email]> wrote:
>>>>>> >
>>>>>> >
>>>>>> > On 15 November 2016 at 02:18, Eliot Miranda <[hidden email]> wrote:
>>>>>> > Hi Phil,
>>>>>> >
>>>>>> > On Thu, Nov 10, 2016 at 2:19 AM, [hidden email] <[hidden email]> wrote:
>>>>>> >
>>>>>> >
>>>>>> > On Thu, Nov 10, 2016 at 10:31 AM, Denis Kudriashov <[hidden email]> wrote:
>>>>>> >
>>>>>> > 2016-11-10 9:49 GMT+01:00 [hidden email] <[hidden email]>:
>>>>>> > Ah, but then it may be more interesting to have a data image (maybe a lot of these) and a front end image.
>>>>>> >
>>>>>> > Isn't Seamless something that could help us here? No need to bring the data back, just manipulate it through proxies.
>>>>>> >
>>>>>> > Problem that server image will anyway perform GC. And it will be slow if server image is big which will stop all world.
>>>>>> >
>>>>>> > What if we asked it to not do any GC at all? Like if we have tons of RAM, why bother? Especially if what it is used to is to keep datasets: load them, save image to disk. When needed trash the loaded stuff and reload from zero.
>>>>>> >
>>>>>> > Basically that is what happens with Spark.
>>>>>> >
>>>>>> > http://sujee.net/2015/01/22/understanding-spark-caching/#.WCRIgy0rKpo
>>>>>> > https://0x0fff.com/spark-misconceptions/
>>>>>> >
>>>>>> > While global GC may not be useful for big-data scavenging probably will be for any non-trivial query.  But I think I see a misconception here.  The large RAM on a multiword machine would be divided up between the cores.  It makes no sense to run a single Smalltalk across lots of cores (we're a long way from having a thread-safe class library).  It makes much more sense to have one Smalltalk per core.  So that brings the heap sizes down and makes GC less scary.
>>>>>> >
>>>>>> > yep, that approach what we're tried in HydraVM
>>>>>> >
>>>>>> >
>>>>>> > and Tachyon/Alluxio is kind of solving this kind of issue (may be nice to have that interacting with Pharo image). http://www.alluxio.org/ This thing basically keeps stuff in memory in case one needs to reuse the data between workload runs.
>>>>>> >
>>>>>> > Sure.  We have all the facilities we need to do this.  We can add and remove code at runtime so we can keep live instances running, and send the code to them along with the data we want them to crunch.
>>>>>> >
>>>>>> >
>>>>>> > Or have an object memory for work and one for datasets (first one gets GC'd, the other one isn't).
>>>>>> >
>>>>>> > Or have policies which one can switch.  There are quite a few levers into the GC from the image and one can easily switch off global GC with the right levers.  One doesn't need a VM that doesn't contain a GC.  One needs an image that is using the right policy.
>>>>>> >
>>>>>> > or just mark whole data (sub)graphs with some bit, telling GC to skip over this so it won't attempt to scan it treating them as always alive..
>>>>>> > this is where we getting back to my idea of heap spaces, where you can toss a subgraph into a special heap space that has such policy, that it is never scanned/GCed automatically and can be triggered only manually or something like that.
>>>>>> >
>>>>>> > Could be very useful for all kinds of large binary data, like videos and sounds that we can load once and keep in the heap space.
>>>>>> >
>>>>>> > How hard would it be to get something like that?
>>>>>>
>>>>>> Large binary data poses no problem (as long as it's not a copying GC). Since a binary blob contains no subpointers, no work needs to be done. A 1M or 1G ByteArray is the same amount of GC work.
>>>>>
>>>>>
>>>>> +1
>>>>
>>>>
>>>> Amen to that. But a dataset made of a gazillion of composites is not the same, right?
>>>>
>>> yep, as soon as you have references in your data, you add more work for GC
>>
>>
>> That's what I tought. I have seen Craig Latta marking some objects with special flags in the object headers. Could there be some generic mechanism there now that we have 64-bit, super large headers? Like setting/resetting a kind of bitmask to let some spaces be GC'd or left alone? Things that we could manage image side?
>>
> well, adding bit(s) is just a simplest part of story. the main one is implement GC discipline to not walk over marked object(s), but as well, is by having a mechanism to ensure that marked object(s) form a closed subgraph (i.e. there's no references coming outside of it)
> scanning+marking a graph is usually a simple matter, you just need to provide a root(s). I had experiments with it in HydraVM, with a process we called mytosis - but it has slightly different purpose: 
> - i implemented two primitives, the one that scans graph and reports if it fully isolated
> and another one is to basically clone the graph into separate memory region to start it as an image in own thread etc.
> But in our scenario, i imagine, that you cannot fully avoid external references - the most obvious one is instance->class references. In that case, we need some kind of mechanism to ensure that class objects that
> referenced by object(s) in desired data set are kept in a system as long as our blob is unchanges. That could be solved by simply declaring a 'fixed' set of external references per a subgraph which live as a normal object(s) in a system, with only exception, like i mentioned, that it need to ensured it won't be GCed, or even better won't be moved as long as our isolated graph is in use.
> Then the only what is left is to set the whole graph into read-only mode and you're ready to go..
> And then, as you can imagine, having such mechanism opens even more interesting opportunities, like offloading graph on disk and/or (re)loading it on demand etc. Which is closely related to my flame-topic in this thread :)
> But the point is, that identifying subgraph(s) and designating it cannot be automated - this will always be a responsibility of user(s), because only user knows best, what he wants to be used as a static data and what are not etc etc.

Yes, I understand the implications and the root object thing.

I also read about Mariano's work on Marea which could do the disk piece.

Maybe a package manifest can help for doing the specification fo what should stay put.

Is there any way we could get a grant or something for such a project?

It is really important to have such features to avoid massive GC pauses.

My use case is to load the data sets from here. https://www.google.be/url?sa=t&source=web&rct=j&url=http://proba-v.vgt.vito.be/sites/default/files/Product_User_Manual.pdf&ved=0ahUKEwjwlOG-4L7QAhWBniwKHZVmDZcQFggpMAI&usg=AFQjCNGRME9ZyHWQ8yCPgAQBDi1PUmzhbQ&sig2=eyaT4DlWCTjqUdQGBhFY0w

Phil

>  
>>
>> (damn, I need more money in the bank to let me work on these things for a long stretch, it is so frustrating </end of rant>).
>>
>> Phil
>>
>>>  
>>>>
>>>> Phil 
>>>>>
>>>>>
>>>>> _,,,^..^,,,_
>>>>> best, Eliot
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Best regards,
>>> Igor Stasenko.
>>
>>
>
>
>
> --
> Best regards,
> Igor Stasenko.

Reply | Threaded
Open this post in threaded view
|

Re: Breaking the 4GB barrier with Pharo 6 64-bit

Thierry Goubier
Hi Phil,

2016-11-23 12:17 GMT+01:00 [hidden email] <[hidden email]>:

[ ...]

It is really important to have such features to avoid massive GC pauses.

My use case is to load the data sets from here. https://www.google.be/url?sa=t&source=web&rct=j&url=http://proba-v.vgt.vito.be/sites/default/files/Product_User_Manual.pdf&ved=0ahUKEwjwlOG-4L7QAhWBniwKHZVmDZcQFggpMAI&usg=AFQjCNGRME9ZyHWQ8yCPgAQBDi1PUmzhbQ&sig2=eyaT4DlWCTjqUdQGBhFY0w

I've used that type of data before, a long time ago.

I consider that tiled / on-demand block loading is the way to go for those. Work with the header as long as possible, stream tiles if you need to work on the full data set. There is a good chance that:

1- You're memory bound for anything you compute with them
2- I/O times dominates, or become low enough to don't care (very fast SSDs)
3- It's very rare that you need full random access on the complete array
4- GC doesn't matter

Stream computing is your solution! This is how the raster GIS are implemented.

What is hard for me is manipulating a very large graph, or a sparse very large structure, like a huge Famix model or a FPGA layout model with a full design layed out on top. There, you're randomly accessing the whole of the structure (or at least you see no obvious partition) and the structure is too large for the memory or the GC.

This is why I had a long time ago this idea of a in-memory working-set / on-disk full structure with automatic determination of what the working set is.

For pointers, have a look at the Graph500 and HPCG benchmarks, especially the efficiency (ratio to peak) of HPCG runs, to see how difficult these cases are.

Regards,

Thierry
Reply | Threaded
Open this post in threaded view
|

Re: Breaking the 4GB barrier with Pharo 6 64-bit

philippeback
Thanks Thierry.

Please also see that with new satellites, the resolution is ever increasing (e.g. Sentinel http://m.esa.int/Our_Activities/Observing_the_Earth/Copernicus/Overview4)

I understand the tile thing and indeed a lot of the algos work on tiles, but there are other ways to do this and especially with real time geo queries on custom defined polygons, you go only so far with tiles. A reason why we are using GeoTrellis backed by Accumulo in order to pump data very fast in random order.

We are adding 30+ servers to the cluster at the moment just to deal with the sizes as there is a project mapping energy landscape https://vito.be/en/land-use/land-use/energy-landscapes. This thing is throwing YARN containers and uses CPU like, intensively. It is not uncommon for me to see their workload eating everything for a serious amount of CPU seconds.

It would be silly not to plug Pharo into all of this infrastructure I think. 

Especially given the PhD/Postdoc/brainiacs per square meter there. If you have seen the Lost TV show, well, it kind of feels working there at that place. Especially given that is is kind of hidden in the woods.

Maybe you could have interesting interactions with them. These guys also have their own nuclear reactor and geothermal drilling.

Phil



On Wed, Nov 23, 2016 at 1:30 PM, Thierry Goubier <[hidden email]> wrote:
Hi Phil,

2016-11-23 12:17 GMT+01:00 [hidden email] <[hidden email]>:

[ ...]

It is really important to have such features to avoid massive GC pauses.

My use case is to load the data sets from here. https://www.google.be/url?sa=t&source=web&rct=j&url=http://proba-v.vgt.vito.be/sites/default/files/Product_User_Manual.pdf&ved=0ahUKEwjwlOG-4L7QAhWBniwKHZVmDZcQFggpMAI&usg=AFQjCNGRME9ZyHWQ8yCPgAQBDi1PUmzhbQ&sig2=eyaT4DlWCTjqUdQGBhFY0w

I've used that type of data before, a long time ago.

I consider that tiled / on-demand block loading is the way to go for those. Work with the header as long as possible, stream tiles if you need to work on the full data set. There is a good chance that:

1- You're memory bound for anything you compute with them
2- I/O times dominates, or become low enough to don't care (very fast SSDs)
3- It's very rare that you need full random access on the complete array
4- GC doesn't matter

Stream computing is your solution! This is how the raster GIS are implemented.

What is hard for me is manipulating a very large graph, or a sparse very large structure, like a huge Famix model or a FPGA layout model with a full design layed out on top. There, you're randomly accessing the whole of the structure (or at least you see no obvious partition) and the structure is too large for the memory or the GC.

This is why I had a long time ago this idea of a in-memory working-set / on-disk full structure with automatic determination of what the working set is.

For pointers, have a look at the Graph500 and HPCG benchmarks, especially the efficiency (ratio to peak) of HPCG runs, to see how difficult these cases are.

Regards,

Thierry

Reply | Threaded
Open this post in threaded view
|

Re: Breaking the 4GB barrier with Pharo 6 64-bit

Thierry Goubier


2016-11-23 15:46 GMT+01:00 [hidden email] <[hidden email]>:
Thanks Thierry.

Please also see that with new satellites, the resolution is ever increasing (e.g. Sentinel http://m.esa.int/Our_Activities/Observing_the_Earth/Copernicus/Overview4)

It has allways been so. Anytime you reach a reasonable size, they send a new satellite with higher res / larger images :)
 

I understand the tile thing and indeed a lot of the algos work on tiles, but there are other ways to do this and especially with real time geo queries on custom defined polygons, you go only so far with tiles. A reason why we are using GeoTrellis backed by Accumulo in order to pump data very fast in random order.

But that mean you're dealing with preprocessed / graph georeferenced data (aka openstreetmap type of data). If you're dealing with raster, your polygons are approximated by a set of tiles (with a nice tile size well suited to your network / disk array).

I had reasonable success a long time ago (1991, I think), for Ifremer, with an unbalanced, sort of quadtree based decomposition for highly irregular curves on the seabed. Tree node size / tile size was computed to be exactly equal to the disk block size on a very slow medium. That sort of work is in the line of a geographic index for a database: optimise query accesses to geo-referenced objects... what is hard, and probably what you are doing, is combining geographic queries with graph queries (give me all houses in Belgium within a ten minutes bus + walk trip to a primary school)(*)

(*) One can work that out on a raster for speed. This is what GRASS does for example.

(**) I asked a student to accelerate some raster processing on a very small FPGA a long time ago. Once he had understood he could pipeline the design to increase the frequency, he then discovered that the FPGA would happily grok data faster than the computer bus could provide it :) leaving no bandwith for the data to be written back to memory.
 

We are adding 30+ servers to the cluster at the moment just to deal with the sizes as there is a project mapping energy landscape https://vito.be/en/land-use/land-use/energy-landscapes. This thing is throwing YARN containers and uses CPU like, intensively. It is not uncommon for me to see their workload eating everything for a serious amount of CPU seconds.

Only a few seconds ?
 

It would be silly not to plug Pharo into all of this infrastructure I think. 

I've had quite bad results with Pharo on compute intensive code recently, so I'd plan carefully how I use it. On that sort of hardware, in the projects I'm working on, 1000x faster than Pharo on a single node is about an expected target.
 

Especially given the PhD/Postdoc/brainiacs per square meter there. If you have seen the Lost TV show, well, it kind of feels working there at that place. Especially given that is is kind of hidden in the woods.

Maybe you could have interesting interactions with them. These guys also have their own nuclear reactor and geothermal drilling.

I'd be interested, because we're working a bit on high performance parallel runtimes and compilation for those. If one day you happen to be ready to talk about it in our place? South of Paris, not too hard to reach by public transport :)

Thierry



Phil



On Wed, Nov 23, 2016 at 1:30 PM, Thierry Goubier <[hidden email]> wrote:
Hi Phil,

2016-11-23 12:17 GMT+01:00 [hidden email] <[hidden email]>:

[ ...]

It is really important to have such features to avoid massive GC pauses.

My use case is to load the data sets from here. https://www.google.be/url?sa=t&source=web&rct=j&url=http://proba-v.vgt.vito.be/sites/default/files/Product_User_Manual.pdf&ved=0ahUKEwjwlOG-4L7QAhWBniwKHZVmDZcQFggpMAI&usg=AFQjCNGRME9ZyHWQ8yCPgAQBDi1PUmzhbQ&sig2=eyaT4DlWCTjqUdQGBhFY0w

I've used that type of data before, a long time ago.

I consider that tiled / on-demand block loading is the way to go for those. Work with the header as long as possible, stream tiles if you need to work on the full data set. There is a good chance that:

1- You're memory bound for anything you compute with them
2- I/O times dominates, or become low enough to don't care (very fast SSDs)
3- It's very rare that you need full random access on the complete array
4- GC doesn't matter

Stream computing is your solution! This is how the raster GIS are implemented.

What is hard for me is manipulating a very large graph, or a sparse very large structure, like a huge Famix model or a FPGA layout model with a full design layed out on top. There, you're randomly accessing the whole of the structure (or at least you see no obvious partition) and the structure is too large for the memory or the GC.

This is why I had a long time ago this idea of a in-memory working-set / on-disk full structure with automatic determination of what the working set is.

For pointers, have a look at the Graph500 and HPCG benchmarks, especially the efficiency (ratio to peak) of HPCG runs, to see how difficult these cases are.

Regards,

Thierry


Reply | Threaded
Open this post in threaded view
|

Re: Breaking the 4GB barrier with Pharo 6 64-bit

philippeback


On Wed, Nov 23, 2016 at 4:16 PM, Thierry Goubier <[hidden email]> wrote:


2016-11-23 15:46 GMT+01:00 [hidden email] <[hidden email]>:
Thanks Thierry.

Please also see that with new satellites, the resolution is ever increasing (e.g. Sentinel http://m.esa.int/Our_Activities/Observing_the_Earth/Copernicus/Overview4)

It has allways been so. Anytime you reach a reasonable size, they send a new satellite with higher res / larger images :)
 

I understand the tile thing and indeed a lot of the algos work on tiles, but there are other ways to do this and especially with real time geo queries on custom defined polygons, you go only so far with tiles. A reason why we are using GeoTrellis backed by Accumulo in order to pump data very fast in random order.

But that mean you're dealing with preprocessed / graph georeferenced data (aka openstreetmap type of data). If you're dealing with raster, your polygons are approximated by a set of tiles (with a nice tile size well suited to your network / disk array).

I had reasonable success a long time ago (1991, I think), for Ifremer, with an unbalanced, sort of quadtree based decomposition for highly irregular curves on the seabed. Tree node size / tile size was computed to be exactly equal to the disk block size on a very slow medium. That sort of work is in the line of a geographic index for a database: optimise query accesses to geo-referenced objects... what is hard, and probably what you are doing, is combining geographic queries with graph queries (give me all houses in Belgium within a ten minutes bus + walk trip to a primary school)(*)

(*) One can work that out on a raster for speed. This is what GRASS does for example.

(**) I asked a student to accelerate some raster processing on a very small FPGA a long time ago. Once he had understood he could pipeline the design to increase the frequency, he then discovered that the FPGA would happily grok data faster than the computer bus could provide it :) leaving no bandwith for the data to be written back to memory.

Yes, but network can be pretty fast with bonded Ethernet interfaces these days. 
 

We are adding 30+ servers to the cluster at the moment just to deal with the sizes as there is a project mapping energy landscape https://vito.be/en/land-use/land-use/energy-landscapes. This thing is throwing YARN containers and uses CPU like, intensively. It is not uncommon for me to see their workload eating everything for a serious amount of CPU seconds.

Only a few seconds ?

CPU-seconds, that the cluster usage unit for CPU. http://serverfault.com/questions/138703/a-definition-for-a-cpu-second
So, says couple millions of them on a 640 core setup. CPU power is the limiting factor in these workloads it seems.  
 

It would be silly not to plug Pharo into all of this infrastructure I think. 

I've had quite bad results with Pharo on compute intensive code recently, so I'd plan carefully how I use it. On that sort of hardware, in the projects I'm working on, 1000x faster than Pharo on a single node is about an expected target.

Sure, but lower level C/C++ things are run from Python or Java, so Pharo will not do worse. The good bit about Pharo is that one can ship a preloaded image and that is easier than sending gigabyte (!) sized uberjars around, that Java will unzip before running, also true with Python myriad of dependencies. An image file appears super small then. 
 

Especially given the PhD/Postdoc/brainiacs per square meter there. If you have seen the Lost TV show, well, it kind of feels working there at that place. Especially given that is is kind of hidden in the woods.

Maybe you could have interesting interactions with them. These guys also have their own nuclear reactor and geothermal drilling.

I'd be interested, because we're working a bit on high performance parallel runtimes and compilation for those. If one day you happen to be ready to talk about it in our place? South of Paris, not too hard to reach by public transport :)

Sure, that would be awesome. But Q1Y17 then because my schedule is pretty packed at the moment. I can show you the thing over the web from my side, so you can see where are in terms of systems. I guess you are much more advanced but one of the goals of the project here is to be pretty approachable and gather a community that will cross pollinate algos and datasets for network effects.

Phil
 
Thierry



Phil



On Wed, Nov 23, 2016 at 1:30 PM, Thierry Goubier <[hidden email]> wrote:
Hi Phil,

2016-11-23 12:17 GMT+01:00 [hidden email] <[hidden email]>:

[ ...]

It is really important to have such features to avoid massive GC pauses.

My use case is to load the data sets from here. https://www.google.be/url?sa=t&source=web&rct=j&url=http://proba-v.vgt.vito.be/sites/default/files/Product_User_Manual.pdf&ved=0ahUKEwjwlOG-4L7QAhWBniwKHZVmDZcQFggpMAI&usg=AFQjCNGRME9ZyHWQ8yCPgAQBDi1PUmzhbQ&sig2=eyaT4DlWCTjqUdQGBhFY0w

I've used that type of data before, a long time ago.

I consider that tiled / on-demand block loading is the way to go for those. Work with the header as long as possible, stream tiles if you need to work on the full data set. There is a good chance that:

1- You're memory bound for anything you compute with them
2- I/O times dominates, or become low enough to don't care (very fast SSDs)
3- It's very rare that you need full random access on the complete array
4- GC doesn't matter

Stream computing is your solution! This is how the raster GIS are implemented.

What is hard for me is manipulating a very large graph, or a sparse very large structure, like a huge Famix model or a FPGA layout model with a full design layed out on top. There, you're randomly accessing the whole of the structure (or at least you see no obvious partition) and the structure is too large for the memory or the GC.

This is why I had a long time ago this idea of a in-memory working-set / on-disk full structure with automatic determination of what the working set is.

For pointers, have a look at the Graph500 and HPCG benchmarks, especially the efficiency (ratio to peak) of HPCG runs, to see how difficult these cases are.

Regards,

Thierry



Reply | Threaded
Open this post in threaded view
|

Re: Breaking the 4GB barrier with Pharo 6 64-bit

Thierry Goubier
Le 23/11/2016 à 20:11, [hidden email] a écrit :

>
>
> On Wed, Nov 23, 2016 at 4:16 PM, Thierry Goubier
> <[hidden email] <mailto:[hidden email]>> wrote:
>
>
>
>     2016-11-23 15:46 GMT+01:00 [hidden email]
>     <mailto:[hidden email]> <[hidden email]
>     <mailto:[hidden email]>>:
>
>         Thanks Thierry.
>
>         Please also see that with new satellites, the resolution is ever
>         increasing (e.g.
>         Sentinel http://m.esa.int/Our_Activities/Observing_the_Earth/Copernicus/Overview4
>         <http://m.esa.int/Our_Activities/Observing_the_Earth/Copernicus/Overview4>)
>
>
>     It has allways been so. Anytime you reach a reasonable size, they
>     send a new satellite with higher res / larger images :)
>
>
>
>         I understand the tile thing and indeed a lot of the algos work
>         on tiles, but there are other ways to do this and especially
>         with real time geo queries on custom defined polygons, you go
>         only so far with tiles. A reason why we are using GeoTrellis
>         backed by Accumulo in order to pump data very fast in random order.
>
>
>     But that mean you're dealing with preprocessed / graph georeferenced
>     data (aka openstreetmap type of data). If you're dealing with
>     raster, your polygons are approximated by a set of tiles (with a
>     nice tile size well suited to your network / disk array).
>
>     I had reasonable success a long time ago (1991, I think), for
>     Ifremer, with an unbalanced, sort of quadtree based decomposition
>     for highly irregular curves on the seabed. Tree node size / tile
>     size was computed to be exactly equal to the disk block size on a
>     very slow medium. That sort of work is in the line of a geographic
>     index for a database: optimise query accesses to geo-referenced
>     objects... what is hard, and probably what you are doing, is
>     combining geographic queries with graph queries (give me all houses
>     in Belgium within a ten minutes bus + walk trip to a primary school)(*)
>
>     (*) One can work that out on a raster for speed. This is what GRASS
>     does for example.
>
>     (**) I asked a student to accelerate some raster processing on a
>     very small FPGA a long time ago. Once he had understood he could
>     pipeline the design to increase the frequency, he then discovered
>     that the FPGA would happily grok data faster than the computer bus
>     could provide it :) leaving no bandwith for the data to be written
>     back to memory.
>
>
> Yes, but network can be pretty fast with bonded Ethernet interfaces
> these days.

You mean they are not using HPC interconnects ?

>         We are adding 30+ servers to the cluster at the moment just to
>         deal with the sizes as there is a project mapping energy
>         landscape https://vito.be/en/land-use/land-use/energy-landscapes
>         <https://vito.be/en/land-use/land-use/energy-landscapes>. This
>         thing is throwing YARN containers and uses CPU like,
>         intensively. It is not uncommon for me to see their workload
>         eating everything for a serious amount of CPU seconds.
>
>
>     Only a few seconds ?
>
>
> CPU-seconds, that the cluster usage unit for CPU.
> http://serverfault.com/questions/138703/a-definition-for-a-cpu-second
> So, says couple millions of them on a 640 core setup. CPU power is the
> limiting factor in these workloads it seems.

If I understand well, the cluster has enough memory to load in RAM all
the data, then.

>         It would be silly not to plug Pharo into all of this
>         infrastructure I think.
>
>
>     I've had quite bad results with Pharo on compute intensive code
>     recently, so I'd plan carefully how I use it. On that sort of
>     hardware, in the projects I'm working on, 1000x faster than Pharo on
>     a single node is about an expected target.
>
>
> Sure, but lower level C/C++ things are run from Python or Java, so Pharo
> will not do worse. The good bit about Pharo is that one can ship a
> preloaded image and that is easier than sending gigabyte (!) sized
> uberjars around, that Java will unzip before running, also true with
> Python myriad of dependencies. An image file appears super small then.

Agreed. Pharo 64bits is interesting there because it installs a lot
better than the 32bits version. And as far as I could see, at least as
stable as the 32bits version for my needs.

>         Especially given the PhD/Postdoc/brainiacs per square meter
>         there. If you have seen the Lost TV show, well, it kind of feels
>         working there at that place. Especially given that is is kind of
>         hidden in the woods.
>
>         Maybe you could have interesting interactions with them. These
>         guys also have their own nuclear reactor and geothermal drilling.
>
>
>     I'd be interested, because we're working a bit on high performance
>     parallel runtimes and compilation for those. If one day you happen
>     to be ready to talk about it in our place? South of Paris, not too
>     hard to reach by public transport :)
>
> Sure, that would be awesome. But Q1Y17 then because my schedule is
> pretty packed at the moment. I can show you the thing over the web from
> my side, so you can see where are in terms of systems. I guess you are
> much more advanced but one of the goals of the project here is to be
> pretty approachable and gather a community that will cross pollinate
> algos and datasets for network effects.

Ok. We can arrange that; I'm also quite busy until year end ;) The goal
here is also to make such high performance systems more usable, but, on
average, the targeted system is a bit more HPC-oriented (dedicated
interconnects, nodes with GPUs or Xeon Phi). We also have some
interesting work going on with microservers (highly-packed,
high-efficiency servers with lower power cpus, ARM, FPGAs).

Thierry

123