One order of magnitude faster sets/dicts?!

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

One order of magnitude faster sets/dicts?!

Adrian Lienhard
This mail is quite long as I copied it from a conversation with Toon Verwaest. Toon and Camillo re-implemented hashed collections for Pinocchio (http://scg.unibe.ch/research/pinocchio). Their benchmarks have shown that the new implementation is significantly faster than the one in Pharo 1.1. The source code is available from www.squeaksource.com/p.html (they haven't set a license, but I'm pretty sure they'll release their code as MIT).

I post this here to find somebody knowledgeable in the area to take a look at the code and figure out whether this implementation would be interesting for Pharo...

Cheers,
Adrian


citing Toon:

We ran some benchmarks on our data structures (It's in comparison with and benchmarked in standard Pharo 1.1). We just add / remove integers within a certain range thus always perfectly distributed making it ideal for Pharo dictionaries.

The numbers show how long a certain benchmark took on average in 20 runs of the benchmark and its standard deviation over those runs. A single benchmark run generally consists of many individual operations on the collection in question. For example:

benchIncludes
   1 to: set size * 2 do: [ :i|
       set includes: i ]

where the set has size 10000. Evaluating this once results in a single run. So the number at the top of the benchmark is the sum of all averages with the average stdev.

The results:

PDictionary: 15.8 +/-1.7
   Do                0.472 +/-0.079
   RemoveKey         0.010 +/-0.021
   AtPut             0.947 +/-0.02
   AtIfAbsentPut     1.64 +/-0.96
   AtPutExisting     0.198 +/-0.011
   KeysAndValuesDo   0.502 +/-0.09
   IncludesKey       0.365 +/-0.024
   AtPutNew          4.2 +/-1.3
   Includes          7.40 +/-0.33

Dictionary: 138.2 +/-4.7
   Do                0.830 +/-0.098
   RemoveKey         10.292 +/-0.077
   AtPut             0.957 +/-0.044
   AtIfAbsentPut     1.52 +/-0.95
   AtPutExisting     0.203 +/-0.011
   KeysAndValuesDo   0.850 +/-0.096
   IncludesKey       80.25 +/-0.25
   AtPutNew          3.5 +/-1.3
   Includes          39.8 +/-4.4

PSet: 1.767 +/-0.057
   Remove            0.008 +/-0.018
   Add               0.845 +/-0.039
   AddExisting       0.310 +/-0.021
   AddNew            0.240 +/-0.021
   Includes          0.365 +/-0.024

Set: 70.9 +/-1.2
   Remove            7.737 +/-0.051
   Add               0.755 +/-0.022
   AddExisting       0.305 +/-0.015
   AddNew            2.473 +/-0.03
   Includes          59.6 +/-1.2


Obviously the very slight differences have to be taken with a grain of salt, but whenever it's more than 10x faster it's quite unlikely that it's due to some random runtime variation that might totally change in another run.

An important thing for our datastructures is that they don't degrade since we don't have colliding hashes; but we didn't really test that yet (although it becomes obvious in some of the benchmark results such as removing of elements and testing presence of elements).

> Is there anything special that one would neet to consider when replacing the old implementation in Pharo with yours?

What doesn't happen at the moment is shrinking dictionaries after they have grown significantly. I wouldn't know at what point to do so either; nor do I think Dictionary does that?

There is a parameter indicating how many elements can be added before it switches from a SmallDictionary version to a dictionary with buckets. This is  set to 20 by default, and I have no idea if it makes sense to change it; I didn't really profile what the optimal number is. Maybe it makes sense to make that a constant in the code rather than an instance variable to save a bit of space and time ...

We have our own PHashedCollection subclassing from Collection; so the biggest part of the hierarchy is compatible with what Pharo has. Mostly while porting Pinocchio-specific stuff will have to be removed; mostly annotations related to Pinocchio Primitives and exporting directives. This is all straightforward though.

The current "ratio" of hashes to elements is set to 500%. So basically when you have 32 buckets it will only grow when it contains 160 elements (or key-value pairs), so 5 elements per bucket on average. This seems to not make it slower which is understandable since SmallDictionary is pretty fast for small amounts of elements; but this ratio might have to be tweaked depending on the kind of objects and is currently a parameter. The advantage of using 500% is that you have A LOT less garbage.

Oh, and our dictionaries do use a lot less garbage, since Pharo's dictionaries use association objects for each key-value pair. We don't, so we never generate garbage when you remove an object.
Our Sets on the other hand do use a bit more memory because of the bucket scheme. Sets don't use associations so we don't have an edge there :) Only performance-wise in that case.

The hashes are masked with the amount of buckets - 1, since the buckets are always a power of 2.
This implies that the lower bits of the hash are the most significant ones. This does not work together with the current identityHash / IdentityDictionary since the 12bit hash is shifted to the left by 18 (or so). This would make all the lower 18 bits 0 and all objects thus ending up in the first bucket.

That is a very important thing to note enough. However, I also think it's a bit strange to just bitshift the identityHash by 18. You don't win any significant numbers with it ... you have 12bit hashes; no way to improve that (without changing the image format). Since we overfill our dictionaries however this works quite ok with those bad hashes; and we'll only start wasting a bit of space after you have added 20480 elements to the dictionary (you have 4096 different hashes with 12 bits, * 5).



_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: One order of magnitude faster sets/dicts?!

Levente Uzonyi-2
On Mon, 13 Sep 2010, Adrian Lienhard wrote:

> This mail is quite long as I copied it from a conversation with Toon Verwaest. Toon and Camillo re-implemented hashed collections for Pinocchio (http://scg.unibe.ch/research/pinocchio). Their benchmarks have shown that the new implementation is significantly faster than the one in Pharo 1.1. The source code is available from www.squeaksource.com/p.html (they haven't set a license, but I'm pretty sure they'll release their code as MIT).
>
> I post this here to find somebody knowledgeable in the area to take a look at the code and figure out whether this implementation would be interesting for Pharo...
>
> Cheers,
> Adrian
>
>
> citing Toon:
>
> We ran some benchmarks on our data structures (It's in comparison with and benchmarked in standard Pharo 1.1). We just add / remove integers within a certain range thus always perfectly distributed making it ideal for Pharo dictionaries.
>
> The numbers show how long a certain benchmark took on average in 20 runs of the benchmark and its standard deviation over those runs. A single benchmark run generally consists of many individual operations on the collection in question. For example:
>
> benchIncludes
>   1 to: set size * 2 do: [ :i|
>       set includes: i ]
>
> where the set has size 10000. Evaluating this once results in a single run. So the number at the top of the benchmark is the sum of all averages with the average stdev.
>
> The results:
>
> PDictionary: 15.8 +/-1.7
>   Do                0.472 +/-0.079
>   RemoveKey         0.010 +/-0.021
>   AtPut             0.947 +/-0.02
>   AtIfAbsentPut     1.64 +/-0.96
>   AtPutExisting     0.198 +/-0.011
>   KeysAndValuesDo   0.502 +/-0.09
>   IncludesKey       0.365 +/-0.024
>   AtPutNew          4.2 +/-1.3
>   Includes          7.40 +/-0.33
>
> Dictionary: 138.2 +/-4.7
>   Do                0.830 +/-0.098
>   RemoveKey         10.292 +/-0.077
>   AtPut             0.957 +/-0.044
>   AtIfAbsentPut     1.52 +/-0.95
>   AtPutExisting     0.203 +/-0.011
>   KeysAndValuesDo   0.850 +/-0.096
>   IncludesKey       80.25 +/-0.25
>   AtPutNew          3.5 +/-1.3
>   Includes          39.8 +/-4.4
>
> PSet: 1.767 +/-0.057
>   Remove            0.008 +/-0.018
>   Add               0.845 +/-0.039
>   AddExisting       0.310 +/-0.021
>   AddNew            0.240 +/-0.021
>   Includes          0.365 +/-0.024
>
> Set: 70.9 +/-1.2
>   Remove            7.737 +/-0.051
>   Add               0.755 +/-0.022
>   AddExisting       0.305 +/-0.015
>   AddNew            2.473 +/-0.03
>   Includes          59.6 +/-1.2
>
>
> Obviously the very slight differences have to be taken with a grain of salt, but whenever it's more than 10x faster it's quite unlikely that it's due to some random runtime variation that might totally change in another run.

I'm a bit skeptic, because 10x improvement is pretty hard to get if the
benchmark is not flawed and the code doesn't use VM support.

I wonder what the #pPrimitive:plugin: pragmas stand for in PDictionary's
#at:ifAbsent: and #at:put:.

Where can I find the benchmark code?

>
> An important thing for our datastructures is that they don't degrade since we don't have colliding hashes; but we didn't really test that yet (although it becomes obvious in some of the benchmark results such as removing of elements and testing presence of elements).
>
>> Is there anything special that one would neet to consider when replacing the old implementation in Pharo with yours?
>
> What doesn't happen at the moment is shrinking dictionaries after they have grown significantly. I wouldn't know at what point to do so either; nor do I think Dictionary does that?

The current HashedCollections don't shrink. Remove is a rarely used
operation.

>
> There is a parameter indicating how many elements can be added before it switches from a SmallDictionary version to a dictionary with buckets. This is  set to 20 by default, and I have no idea if it makes sense to change it; I didn't really profile what the optimal number is. Maybe it makes sense to make that a constant in the code rather than an instance variable to save a bit of space and time ...
>
> We have our own PHashedCollection subclassing from Collection; so the biggest part of the hierarchy is compatible with what Pharo has. Mostly while porting Pinocchio-specific stuff will have to be removed; mostly annotations related to Pinocchio Primitives and exporting directives. This is all straightforward though.
>
> The current "ratio" of hashes to elements is set to 500%. So basically when you have 32 buckets it will only grow when it contains 160 elements (or key-value pairs), so 5 elements per bucket on average. This seems to not make it slower which is understandable since SmallDictionary is pretty fast for small amounts of elements; but this ratio might have to be tweaked depending on the kind of objects and is currently a parameter. The advantage of using 500% is that you have A LOT less garbage.
>
> Oh, and our dictionaries do use a lot less garbage, since Pharo's dictionaries use association objects for each key-value pair. We don't, so we never generate garbage when you remove an object.
> Our Sets on the other hand do use a bit more memory because of the bucket scheme. Sets don't use associations so we don't have an edge there :) Only performance-wise in that case.

One could implement the current dictionaries without associations, that's
just less object-oriented. That would generate even less "garbage".

>
> The hashes are masked with the amount of buckets - 1, since the buckets are always a power of 2.
> This implies that the lower bits of the hash are the most significant ones. This does not work together with the current identityHash / IdentityDictionary since the 12bit hash is shifted to the left by 18 (or so). This would make all the lower 18 bits 0 and all objects thus ending up in the first bucket.
>
> That is a very important thing to note enough. However, I also think it's a bit strange to just bitshift the identityHash by 18. You don't win any significant numbers with it ... you have 12bit hashes; no way to improve that (without changing the image format). Since we overfill our dictionaries however this works quite ok with those bad hashes; and we'll only start wasting a bit of space after you have added 20480 elements to the dictionary (you have 4096 different hashes with 12 bits, * 5).

Let's imagine that we want to put 10000 objects into a IdentitySet. Each
object has it's identityHash in the 0..4095 range. The hash function is
very simple: hash \\ capacity + 1. Therefore the values of the hash
function will all be in 1..4096, even if capacity is ~15000. This causes
a lot of collisions (the table will be consist of a few very long chains
degrading performance to unacceptable leves). These collisions are avoided
by the shift.


Levente

>
>
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: One order of magnitude faster sets/dicts?!

Stéphane Ducasse
In reply to this post by Adrian Lienhard
do you know if they compare with squeak because levent did a lot of work in that area.

Stef

On Sep 13, 2010, at 2:55 PM, Adrian Lienhard wrote:

> This mail is quite long as I copied it from a conversation with Toon Verwaest. Toon and Camillo re-implemented hashed collections for Pinocchio (http://scg.unibe.ch/research/pinocchio). Their benchmarks have shown that the new implementation is significantly faster than the one in Pharo 1.1. The source code is available from www.squeaksource.com/p.html (they haven't set a license, but I'm pretty sure they'll release their code as MIT).
>
> I post this here to find somebody knowledgeable in the area to take a look at the code and figure out whether this implementation would be interesting for Pharo...
>
> Cheers,
> Adrian
>
>
> citing Toon:
>
> We ran some benchmarks on our data structures (It's in comparison with and benchmarked in standard Pharo 1.1). We just add / remove integers within a certain range thus always perfectly distributed making it ideal for Pharo dictionaries.
>
> The numbers show how long a certain benchmark took on average in 20 runs of the benchmark and its standard deviation over those runs. A single benchmark run generally consists of many individual operations on the collection in question. For example:
>
> benchIncludes
>   1 to: set size * 2 do: [ :i|
>       set includes: i ]
>
> where the set has size 10000. Evaluating this once results in a single run. So the number at the top of the benchmark is the sum of all averages with the average stdev.
>
> The results:
>
> PDictionary: 15.8 +/-1.7
>   Do                0.472 +/-0.079
>   RemoveKey         0.010 +/-0.021
>   AtPut             0.947 +/-0.02
>   AtIfAbsentPut     1.64 +/-0.96
>   AtPutExisting     0.198 +/-0.011
>   KeysAndValuesDo   0.502 +/-0.09
>   IncludesKey       0.365 +/-0.024
>   AtPutNew          4.2 +/-1.3
>   Includes          7.40 +/-0.33
>
> Dictionary: 138.2 +/-4.7
>   Do                0.830 +/-0.098
>   RemoveKey         10.292 +/-0.077
>   AtPut             0.957 +/-0.044
>   AtIfAbsentPut     1.52 +/-0.95
>   AtPutExisting     0.203 +/-0.011
>   KeysAndValuesDo   0.850 +/-0.096
>   IncludesKey       80.25 +/-0.25
>   AtPutNew          3.5 +/-1.3
>   Includes          39.8 +/-4.4
>
> PSet: 1.767 +/-0.057
>   Remove            0.008 +/-0.018
>   Add               0.845 +/-0.039
>   AddExisting       0.310 +/-0.021
>   AddNew            0.240 +/-0.021
>   Includes          0.365 +/-0.024
>
> Set: 70.9 +/-1.2
>   Remove            7.737 +/-0.051
>   Add               0.755 +/-0.022
>   AddExisting       0.305 +/-0.015
>   AddNew            2.473 +/-0.03
>   Includes          59.6 +/-1.2
>
>
> Obviously the very slight differences have to be taken with a grain of salt, but whenever it's more than 10x faster it's quite unlikely that it's due to some random runtime variation that might totally change in another run.
>
> An important thing for our datastructures is that they don't degrade since we don't have colliding hashes; but we didn't really test that yet (although it becomes obvious in some of the benchmark results such as removing of elements and testing presence of elements).
>
>> Is there anything special that one would neet to consider when replacing the old implementation in Pharo with yours?
>
> What doesn't happen at the moment is shrinking dictionaries after they have grown significantly. I wouldn't know at what point to do so either; nor do I think Dictionary does that?
>
> There is a parameter indicating how many elements can be added before it switches from a SmallDictionary version to a dictionary with buckets. This is  set to 20 by default, and I have no idea if it makes sense to change it; I didn't really profile what the optimal number is. Maybe it makes sense to make that a constant in the code rather than an instance variable to save a bit of space and time ...
>
> We have our own PHashedCollection subclassing from Collection; so the biggest part of the hierarchy is compatible with what Pharo has. Mostly while porting Pinocchio-specific stuff will have to be removed; mostly annotations related to Pinocchio Primitives and exporting directives. This is all straightforward though.
>
> The current "ratio" of hashes to elements is set to 500%. So basically when you have 32 buckets it will only grow when it contains 160 elements (or key-value pairs), so 5 elements per bucket on average. This seems to not make it slower which is understandable since SmallDictionary is pretty fast for small amounts of elements; but this ratio might have to be tweaked depending on the kind of objects and is currently a parameter. The advantage of using 500% is that you have A LOT less garbage.
>
> Oh, and our dictionaries do use a lot less garbage, since Pharo's dictionaries use association objects for each key-value pair. We don't, so we never generate garbage when you remove an object.
> Our Sets on the other hand do use a bit more memory because of the bucket scheme. Sets don't use associations so we don't have an edge there :) Only performance-wise in that case.
>
> The hashes are masked with the amount of buckets - 1, since the buckets are always a power of 2.
> This implies that the lower bits of the hash are the most significant ones. This does not work together with the current identityHash / IdentityDictionary since the 12bit hash is shifted to the left by 18 (or so). This would make all the lower 18 bits 0 and all objects thus ending up in the first bucket.
>
> That is a very important thing to note enough. However, I also think it's a bit strange to just bitshift the identityHash by 18. You don't win any significant numbers with it ... you have 12bit hashes; no way to improve that (without changing the image format). Since we overfill our dictionaries however this works quite ok with those bad hashes; and we'll only start wasting a bit of space after you have added 20480 elements to the dictionary (you have 4096 different hashes with 12 bits, * 5).
>
>
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project


_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: One order of magnitude faster sets/dicts?!

Adrian Lienhard
No, they just compared to Pharo 1.1.

But the benchmarks are available in a separate package.

Adrian

On Sep 13, 2010, at 15:52 , Stéphane Ducasse wrote:

> do you know if they compare with squeak because levent did a lot of work in that area.
>
> Stef
>
> On Sep 13, 2010, at 2:55 PM, Adrian Lienhard wrote:
>
>> This mail is quite long as I copied it from a conversation with Toon Verwaest. Toon and Camillo re-implemented hashed collections for Pinocchio (http://scg.unibe.ch/research/pinocchio). Their benchmarks have shown that the new implementation is significantly faster than the one in Pharo 1.1. The source code is available from www.squeaksource.com/p.html (they haven't set a license, but I'm pretty sure they'll release their code as MIT).
>>
>> I post this here to find somebody knowledgeable in the area to take a look at the code and figure out whether this implementation would be interesting for Pharo...
>>
>> Cheers,
>> Adrian
>>
>>
>> citing Toon:
>>
>> We ran some benchmarks on our data structures (It's in comparison with and benchmarked in standard Pharo 1.1). We just add / remove integers within a certain range thus always perfectly distributed making it ideal for Pharo dictionaries.
>>
>> The numbers show how long a certain benchmark took on average in 20 runs of the benchmark and its standard deviation over those runs. A single benchmark run generally consists of many individual operations on the collection in question. For example:
>>
>> benchIncludes
>>  1 to: set size * 2 do: [ :i|
>>      set includes: i ]
>>
>> where the set has size 10000. Evaluating this once results in a single run. So the number at the top of the benchmark is the sum of all averages with the average stdev.
>>
>> The results:
>>
>> PDictionary: 15.8 +/-1.7
>>  Do                0.472 +/-0.079
>>  RemoveKey         0.010 +/-0.021
>>  AtPut             0.947 +/-0.02
>>  AtIfAbsentPut     1.64 +/-0.96
>>  AtPutExisting     0.198 +/-0.011
>>  KeysAndValuesDo   0.502 +/-0.09
>>  IncludesKey       0.365 +/-0.024
>>  AtPutNew          4.2 +/-1.3
>>  Includes          7.40 +/-0.33
>>
>> Dictionary: 138.2 +/-4.7
>>  Do                0.830 +/-0.098
>>  RemoveKey         10.292 +/-0.077
>>  AtPut             0.957 +/-0.044
>>  AtIfAbsentPut     1.52 +/-0.95
>>  AtPutExisting     0.203 +/-0.011
>>  KeysAndValuesDo   0.850 +/-0.096
>>  IncludesKey       80.25 +/-0.25
>>  AtPutNew          3.5 +/-1.3
>>  Includes          39.8 +/-4.4
>>
>> PSet: 1.767 +/-0.057
>>  Remove            0.008 +/-0.018
>>  Add               0.845 +/-0.039
>>  AddExisting       0.310 +/-0.021
>>  AddNew            0.240 +/-0.021
>>  Includes          0.365 +/-0.024
>>
>> Set: 70.9 +/-1.2
>>  Remove            7.737 +/-0.051
>>  Add               0.755 +/-0.022
>>  AddExisting       0.305 +/-0.015
>>  AddNew            2.473 +/-0.03
>>  Includes          59.6 +/-1.2
>>
>>
>> Obviously the very slight differences have to be taken with a grain of salt, but whenever it's more than 10x faster it's quite unlikely that it's due to some random runtime variation that might totally change in another run.
>>
>> An important thing for our datastructures is that they don't degrade since we don't have colliding hashes; but we didn't really test that yet (although it becomes obvious in some of the benchmark results such as removing of elements and testing presence of elements).
>>
>>> Is there anything special that one would neet to consider when replacing the old implementation in Pharo with yours?
>>
>> What doesn't happen at the moment is shrinking dictionaries after they have grown significantly. I wouldn't know at what point to do so either; nor do I think Dictionary does that?
>>
>> There is a parameter indicating how many elements can be added before it switches from a SmallDictionary version to a dictionary with buckets. This is  set to 20 by default, and I have no idea if it makes sense to change it; I didn't really profile what the optimal number is. Maybe it makes sense to make that a constant in the code rather than an instance variable to save a bit of space and time ...
>>
>> We have our own PHashedCollection subclassing from Collection; so the biggest part of the hierarchy is compatible with what Pharo has. Mostly while porting Pinocchio-specific stuff will have to be removed; mostly annotations related to Pinocchio Primitives and exporting directives. This is all straightforward though.
>>
>> The current "ratio" of hashes to elements is set to 500%. So basically when you have 32 buckets it will only grow when it contains 160 elements (or key-value pairs), so 5 elements per bucket on average. This seems to not make it slower which is understandable since SmallDictionary is pretty fast for small amounts of elements; but this ratio might have to be tweaked depending on the kind of objects and is currently a parameter. The advantage of using 500% is that you have A LOT less garbage.
>>
>> Oh, and our dictionaries do use a lot less garbage, since Pharo's dictionaries use association objects for each key-value pair. We don't, so we never generate garbage when you remove an object.
>> Our Sets on the other hand do use a bit more memory because of the bucket scheme. Sets don't use associations so we don't have an edge there :) Only performance-wise in that case.
>>
>> The hashes are masked with the amount of buckets - 1, since the buckets are always a power of 2.
>> This implies that the lower bits of the hash are the most significant ones. This does not work together with the current identityHash / IdentityDictionary since the 12bit hash is shifted to the left by 18 (or so). This would make all the lower 18 bits 0 and all objects thus ending up in the first bucket.
>>
>> That is a very important thing to note enough. However, I also think it's a bit strange to just bitshift the identityHash by 18. You don't win any significant numbers with it ... you have 12bit hashes; no way to improve that (without changing the image format). Since we overfill our dictionaries however this works quite ok with those bad hashes; and we'll only start wasting a bit of space after you have added 20480 elements to the dictionary (you have 4096 different hashes with 12 bits, * 5).
>>
>>
>>
>> _______________________________________________
>> Pharo-project mailing list
>> [hidden email]
>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project


_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: One order of magnitude faster sets/dicts?!

Levente Uzonyi-2
In reply to this post by Levente Uzonyi-2
On Mon, 13 Sep 2010, Levente Uzonyi wrote:

> On Mon, 13 Sep 2010, Adrian Lienhard wrote:
>
>> This mail is quite long as I copied it from a conversation with Toon
>> Verwaest. Toon and Camillo re-implemented hashed collections for Pinocchio
>> (http://scg.unibe.ch/research/pinocchio). Their benchmarks have shown that
>> the new implementation is significantly faster than the one in Pharo 1.1.
>> The source code is available from www.squeaksource.com/p.html (they haven't
>> set a license, but I'm pretty sure they'll release their code as MIT).
>>
>> I post this here to find somebody knowledgeable in the area to take a look
>> at the code and figure out whether this implementation would be interesting
>> for Pharo...
>>
>> Cheers,
>> Adrian
>>
>>
>> citing Toon:
>>
>> We ran some benchmarks on our data structures (It's in comparison with and
>> benchmarked in standard Pharo 1.1). We just add / remove integers within a
>> certain range thus always perfectly distributed making it ideal for Pharo
>> dictionaries.
>>
>> The numbers show how long a certain benchmark took on average in 20 runs of
>> the benchmark and its standard deviation over those runs. A single
>> benchmark run generally consists of many individual operations on the
>> collection in question. For example:
>>
>> benchIncludes
>>   1 to: set size * 2 do: [ :i|
>>       set includes: i ]
>>
>> where the set has size 10000. Evaluating this once results in a single run.
>> So the number at the top of the benchmark is the sum of all averages with
>> the average stdev.
>>
>> The results:
>>
>> PDictionary: 15.8 +/-1.7
>>   Do                0.472 +/-0.079
>>   RemoveKey         0.010 +/-0.021
>>   AtPut             0.947 +/-0.02
>>   AtIfAbsentPut     1.64 +/-0.96
>>   AtPutExisting     0.198 +/-0.011
>>   KeysAndValuesDo   0.502 +/-0.09
>>   IncludesKey       0.365 +/-0.024
>>   AtPutNew          4.2 +/-1.3
>>   Includes          7.40 +/-0.33
>>
>> Dictionary: 138.2 +/-4.7
>>   Do                0.830 +/-0.098
>>   RemoveKey         10.292 +/-0.077
>>   AtPut             0.957 +/-0.044
>>   AtIfAbsentPut     1.52 +/-0.95
>>   AtPutExisting     0.203 +/-0.011
>>   KeysAndValuesDo   0.850 +/-0.096
>>   IncludesKey       80.25 +/-0.25
>>   AtPutNew          3.5 +/-1.3
>>   Includes          39.8 +/-4.4
>>
>> PSet: 1.767 +/-0.057
>>   Remove            0.008 +/-0.018
>>   Add               0.845 +/-0.039
>>   AddExisting       0.310 +/-0.021
>>   AddNew            0.240 +/-0.021
>>   Includes          0.365 +/-0.024
>>
>> Set: 70.9 +/-1.2
>>   Remove            7.737 +/-0.051
>>   Add               0.755 +/-0.022
>>   AddExisting       0.305 +/-0.015
>>   AddNew            2.473 +/-0.03
>>   Includes          59.6 +/-1.2
>>
>>
>> Obviously the very slight differences have to be taken with a grain of
>> salt, but whenever it's more than 10x faster it's quite unlikely that it's
>> due to some random runtime variation that might totally change in another
>> run.
>
> I'm a bit skeptic, because 10x improvement is pretty hard to get if the
> benchmark is not flawed and the code doesn't use VM support.

Okay, I found the benchmark code, and it's flawed. The distribution of the
keys is not uniform. Actually it's far from uniform, it's just 1..dictSize.
Anyway I ran the benchmarks in Squeak 4.2 alpha and got the following
results:

PBDictionary: 0.1249 +/-0.0067
RemoveKey 6.7e-500 +/-4.6e-5
AtPutNew 0.0648 +/-0.0048
Do 0.00160 +/-0.0001
AtPutExisting 0.003300000000000002 +/-0.00018
AtIfAbsentPut 0.0192 +/-0.0035
AtPut 0.01557000000000001 +/-0.00028
IncludesKey 0.0180 +/-0.0029
KeysAndValuesDo 0.00223 +/-0.0001200000000000001
Includes 0.000133 +/-6.3e-5

PBSTDictionary: 0.2204 +/-0.0063
RemoveKey 0.07203 +/-0.00032
AtPutNew 0.0434 +/-0.0055
Do 0.00 +/-0.0
AtPutExisting 0.00260 +/-0.00013
AtIfAbsentPut 0.01193 +/-0.0002
AtPut 0.0147 +/-0.0031
IncludesKey 0.01157000000000001 +/-0.0002400000000000001
KeysAndValuesDo 0.002200 +/-7.4e-5
Includes 0.05990000000000003 +/-0.00019

The PDictionary is only better at 2 benchmarks:
Includes (not IncludesKey!) and RemoveKey which are both rarely used. In
all other tests Squeak's Dictionary implementation is faster in this
flawed benchmark.


Levente

>
> I wonder what the #pPrimitive:plugin: pragmas stand for in PDictionary's
> #at:ifAbsent: and #at:put:.
>
> Where can I find the benchmark code?
>
>>
>> An important thing for our datastructures is that they don't degrade since
>> we don't have colliding hashes; but we didn't really test that yet
>> (although it becomes obvious in some of the benchmark results such as
>> removing of elements and testing presence of elements).
>>
>>> Is there anything special that one would neet to consider when replacing
>>> the old implementation in Pharo with yours?
>>
>> What doesn't happen at the moment is shrinking dictionaries after they have
>> grown significantly. I wouldn't know at what point to do so either; nor do
>> I think Dictionary does that?
>
> The current HashedCollections don't shrink. Remove is a rarely used
> operation.
>
>>
>> There is a parameter indicating how many elements can be added before it
>> switches from a SmallDictionary version to a dictionary with buckets. This
>> is  set to 20 by default, and I have no idea if it makes sense to change
>> it; I didn't really profile what the optimal number is. Maybe it makes
>> sense to make that a constant in the code rather than an instance variable
>> to save a bit of space and time ...
>>
>> We have our own PHashedCollection subclassing from Collection; so the
>> biggest part of the hierarchy is compatible with what Pharo has. Mostly
>> while porting Pinocchio-specific stuff will have to be removed; mostly
>> annotations related to Pinocchio Primitives and exporting directives. This
>> is all straightforward though.
>>
>> The current "ratio" of hashes to elements is set to 500%. So basically when
>> you have 32 buckets it will only grow when it contains 160 elements (or
>> key-value pairs), so 5 elements per bucket on average. This seems to not
>> make it slower which is understandable since SmallDictionary is pretty fast
>> for small amounts of elements; but this ratio might have to be tweaked
>> depending on the kind of objects and is currently a parameter. The
>> advantage of using 500% is that you have A LOT less garbage.
>>
>> Oh, and our dictionaries do use a lot less garbage, since Pharo's
>> dictionaries use association objects for each key-value pair. We don't, so
>> we never generate garbage when you remove an object.
>> Our Sets on the other hand do use a bit more memory because of the bucket
>> scheme. Sets don't use associations so we don't have an edge there :) Only
>> performance-wise in that case.
>
> One could implement the current dictionaries without associations, that's
> just less object-oriented. That would generate even less "garbage".
>
>>
>> The hashes are masked with the amount of buckets - 1, since the buckets are
>> always a power of 2.
>> This implies that the lower bits of the hash are the most significant ones.
>> This does not work together with the current identityHash /
>> IdentityDictionary since the 12bit hash is shifted to the left by 18 (or
>> so). This would make all the lower 18 bits 0 and all objects thus ending up
>> in the first bucket.
>>
>> That is a very important thing to note enough. However, I also think it's a
>> bit strange to just bitshift the identityHash by 18. You don't win any
>> significant numbers with it ... you have 12bit hashes; no way to improve
>> that (without changing the image format). Since we overfill our
>> dictionaries however this works quite ok with those bad hashes; and we'll
>> only start wasting a bit of space after you have added 20480 elements to
>> the dictionary (you have 4096 different hashes with 12 bits, * 5).
>
> Let's imagine that we want to put 10000 objects into a IdentitySet. Each
> object has it's identityHash in the 0..4095 range. The hash function is very
> simple: hash \\ capacity + 1. Therefore the values of the hash function will
> all be in 1..4096, even if capacity is ~15000. This causes a lot of
> collisions (the table will be consist of a few very long chains degrading
> performance to unacceptable leves). These collisions are avoided by the
> shift.
>
>
> Levente
>
>>
>>
>>
>> _______________________________________________
>> Pharo-project mailing list
>> [hidden email]
>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: One order of magnitude faster sets/dicts?!

Toon Verwaest-2
Hi,
>> I'm a bit skeptic, because 10x improvement is pretty hard to get if
>> the benchmark is not flawed and the code doesn't use VM support.
> Okay, I found the benchmark code, and it's flawed. The distribution of
> the keys is not uniform. Actually it's far from uniform, it's just
> 1..dictSize.
> Anyway I ran the benchmarks in Squeak 4.2 alpha and got the following
> results
...
> The PDictionary is only better at 2 benchmarks:
> Includes (not IncludesKey!) and RemoveKey which are both rarely used.
> In all other tests Squeak's Dictionary implementation is faster in
> this flawed benchmark.
I agree this benchmark is severely flawed. But it's generally also the
ideal case for Pharo/Squeak dictionaries. Whatever might happen in other
cases to degenerate the dictionary doesn't happen here since keys are
added in the right order (increasing hashes). So whenever place should
be made for a key with an already existing hash in Dictionary, this is
avoided. In PDictionary however this scenario wouldn't even happen. Now
the problem is that I don't have experience in writing proper benchmarks
and this thread could provide us with some decent benchmarks that better
show standard use of dictionaries (and as I hope, show that
PDictionaries are more resistant to degradation than the current
Dictionaries.

What we should take away from the existing benchmarks is that the
differences aren't even that high for the ideal case of Pharo/Squeak
dictionaries.

The main point in favor of the PSet / PDictionary implementations is
that you always know the exact ranges of keys/values that belong to a
certain hash. So for includesKey: and at:ifAbsent: this is ideal since
you never look further than the current hash-value. In Squeak/Pharo
dictionary/set you don't know this, thanks to
hash-collision/propagation, so you have to scan until you find an
association that is nil. If you are unlucky and your whole dictionary
has all the associations grouped together, even though they all
represent keys with different hashes, you'll have to look through a
whole bunch of totally unrelated keys.

For example, I quickly changed one of the benchmarks to be a bit less
optimal for the Squeak/Pharo collections, namely the includesKey:
benchmark (which is also representative for at:)
Rather than doing

     1 to: dict size * 2 do: [ :i|
         dict at: (self key: i) ifAbsentPut: (self value: i)].

I changed it to:

     1 to: dict size * 10 by: 10 do: [ :i|
         dict at: (self key: i) ifAbsentPut: (self value: i)].

This forces the dictionary to miss a lot more often. Since all the keys
are grouped together this is the worst case for Squeak/Pharo
dictionaries if you want to look something up that's not there. This are
the results on Squeak 4.1:

PBDictionary  0.0261 +/-0.0017
PBSTDictionary  5.235 +/-0.01

So around 200x faster in case of PDictionary.

While PDictionary stays stable (even slightly faster, go figure), Squeak
dictionaries degrade immensely! On Pharo the results are a bit better,
but still very much degraded:

PBDictionary 0.02387 +/-0.00014
PBSTDictionary 1.6577 +/-0.0036

So around 60x faster in case of PDictionary.


If on Set I ensure that all elements will be found, I get the following
result for #includes:

PBSetIncludes 0.00440 +/-0.0001100000000000001
PBSTSetIncludes 0.004000 +/-9.6e-5

making PSet marginally slower. However, once you start not finding
elements, with the same change as before, (to: size*10 by: 10):

PBSetIncludes 0.00443 +/-0.0001
PBSTSetIncludes 0.9844000000000005 +/-0.0017

So around 220x faster in case of PSet. While PSet and PDictionary never
really suffer from collisions, normal Squeak/Pharo dictionaries suffer
immensely and grow up to 200x slower. Btw, this last benchmark was run
on Pharo.


Anyway, you are definitely right that the benchmarks suck and that we
need better ones.

As for PDictionaries and PSets, nowing that you know exactly what the
range of existing hashes is I think it's only logical that it will be
faster in many of the operations that need this knowledge, such as
#includesKey:, #remove:, #at:put:/#add: and #at:/#includes:.

Since PDictionary switches between SmallDictionary style and normal
dictionary style, it also seems logical that it behaves faster in case
that you are in SmallDictionary style without having to care about it.
This is also not shown in any benchmark, but it is the case and makes it
faster for that case. However, this added test makes it slower for the
general dictionary style so removing it might improve things in that
case to regain the marginal difference between both dictionaries in the
ideal case for Pharo/Squeak.

There is anyway no reason why PDictionary would be slower than
Pharo/Squeak dictionaries except for design decisions like the
previously mentioned one, and those can be changed if wanted. There is
however a good reason why Squeak/Pharo dictionaries are slower and
degrade faster, namely colliding hashes that even propagate collisions
(also not benchmarked at the moment!).

Ok, on the whole I wasn't really planning on pushing these dictionaries
too much on Pharo / Squeak. We use them in Pinocchio and are happy with
them (especially since we have native support). But they are there and
can be used if you want to. I can understand that people are skeptical,
but since I don't really have knowledge on building benchmarks (I didn't
even set up the current ones, it was a student of mine) I don't think I
can help much there. We use them throughout our project and some parts
(such as our parser) have gotten a lot faster since we started using them.

Btw, the code would indeed just be released under MIT license which is
maybe already automatically the case since I signed that agreement as a
contributor to Pharo?


Some answers / comments to the other random questions and remarks:
> I wonder what the #pPrimitive:plugin: pragmas stand for in PDictionary's
> #at:ifAbsent: and #at:put:.
Those methods are implemented as primitives in the Pinocchio VM that we
are developing. In primitive form most of the methods run ~ twice as
fast. This is however not compatible with the way the stack is handled
in Squeak/Pharo so can't really be ported.
> The current HashedCollections don't shrink. Remove is a rarely used
> operation.
Ok, makes sense.
> One could implement the current dictionaries without associations,
> that's just less object-oriented. That would generate even less
> "garbage".
True enough.
> Let's imagine that we want to put 10000 objects into a IdentitySet.
> Each object has it's identityHash in the 0..4095 range. The hash
> function is very simple: hash \\ capacity + 1. Therefore the values of
> the hash function will all be in 1..4096, even if capacity is ~15000.
> This causes a lot of collisions (the table will be consist of a few
> very long chains degrading performance to unacceptable leves). These
> collisions are avoided by the shift.
So you change the hash implementation to ensure that they take up the
whole array of the Dictionary rather than just the first 4096 +
collisions positions. Makes perfect sense for the squeak/pharo
implementation but rather strange to enforce this on the whole system imho.


Ok, I hope this clears up some things.

cheers,
Toon

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: One order of magnitude faster sets/dicts?!

Toon Verwaest-2
To be fair, the text below just shows that Dictionary / Set in Pharo
degrade from O(1) to O(n) as it has to search through all the elements
in the Set/Dictionary. It's not that those high factors are constants
because of better implementation. Pinocchio dictionaries just stay O(1).
The only way to degrade Pinocchio dictionaries is by giving it bad
hashes; but that's something you can't solve in all cases anyway.

cheers,
Toon

> This forces the dictionary to miss a lot more often. Since all the
> keys are grouped together this is the worst case for Squeak/Pharo
> dictionaries if you want to look something up that's not there. This
> are the results on Squeak 4.1:
>
> PBDictionary  0.0261 +/-0.0017
> PBSTDictionary  5.235 +/-0.01
>
> So around 200x faster in case of PDictionary.
>
> While PDictionary stays stable (even slightly faster, go figure),
> Squeak dictionaries degrade immensely! On Pharo the results are a bit
> better, but still very much degraded:
>
> PBDictionary 0.02387 +/-0.00014
> PBSTDictionary 1.6577 +/-0.0036
>
> So around 60x faster in case of PDictionary.
>
>
> If on Set I ensure that all elements will be found, I get the
> following result for #includes:
>
> PBSetIncludes 0.00440 +/-0.0001100000000000001
> PBSTSetIncludes 0.004000 +/-9.6e-5
>
> making PSet marginally slower. However, once you start not finding
> elements, with the same change as before, (to: size*10 by: 10):
>
> PBSetIncludes 0.00443 +/-0.0001
> PBSTSetIncludes 0.9844000000000005 +/-0.0017
>
> So around 220x faster in case of PSet. While PSet and PDictionary
> never really suffer from collisions, normal Squeak/Pharo dictionaries
> suffer immensely and grow up to 200x slower. Btw, this last benchmark
> was run on Pharo.


_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: One order of magnitude faster sets/dicts?!

Stéphane Ducasse
In reply to this post by Levente Uzonyi-2
tx levente

Stef

>>> This mail is quite long as I copied it from a conversation with Toon Verwaest. Toon and Camillo re-implemented hashed collections for Pinocchio (http://scg.unibe.ch/research/pinocchio). Their benchmarks have shown that the new implementation is significantly faster than the one in Pharo 1.1. The source code is available from www.squeaksource.com/p.html (they haven't set a license, but I'm pretty sure they'll release their code as MIT).
>>> I post this here to find somebody knowledgeable in the area to take a look at the code and figure out whether this implementation would be interesting for Pharo...
>>> Cheers,
>>> Adrian
>>> citing Toon:
>>> We ran some benchmarks on our data structures (It's in comparison with and benchmarked in standard Pharo 1.1). We just add / remove integers within a certain range thus always perfectly distributed making it ideal for Pharo dictionaries.
>>> The numbers show how long a certain benchmark took on average in 20 runs of the benchmark and its standard deviation over those runs. A single benchmark run generally consists of many individual operations on the collection in question. For example:
>>> benchIncludes
>>>  1 to: set size * 2 do: [ :i|
>>>      set includes: i ]
>>> where the set has size 10000. Evaluating this once results in a single run. So the number at the top of the benchmark is the sum of all averages with the average stdev.
>>> The results:
>>> PDictionary: 15.8 +/-1.7
>>>  Do                0.472 +/-0.079
>>>  RemoveKey         0.010 +/-0.021
>>>  AtPut             0.947 +/-0.02
>>>  AtIfAbsentPut     1.64 +/-0.96
>>>  AtPutExisting     0.198 +/-0.011
>>>  KeysAndValuesDo   0.502 +/-0.09
>>>  IncludesKey       0.365 +/-0.024
>>>  AtPutNew          4.2 +/-1.3
>>>  Includes          7.40 +/-0.33
>>> Dictionary: 138.2 +/-4.7
>>>  Do                0.830 +/-0.098
>>>  RemoveKey         10.292 +/-0.077
>>>  AtPut             0.957 +/-0.044
>>>  AtIfAbsentPut     1.52 +/-0.95
>>>  AtPutExisting     0.203 +/-0.011
>>>  KeysAndValuesDo   0.850 +/-0.096
>>>  IncludesKey       80.25 +/-0.25
>>>  AtPutNew          3.5 +/-1.3
>>>  Includes          39.8 +/-4.4
>>> PSet: 1.767 +/-0.057
>>>  Remove            0.008 +/-0.018
>>>  Add               0.845 +/-0.039
>>>  AddExisting       0.310 +/-0.021
>>>  AddNew            0.240 +/-0.021
>>>  Includes          0.365 +/-0.024
>>> Set: 70.9 +/-1.2
>>>  Remove            7.737 +/-0.051
>>>  Add               0.755 +/-0.022
>>>  AddExisting       0.305 +/-0.015
>>>  AddNew            2.473 +/-0.03
>>>  Includes          59.6 +/-1.2
>>> Obviously the very slight differences have to be taken with a grain of salt, but whenever it's more than 10x faster it's quite unlikely that it's due to some random runtime variation that might totally change in another run.
>>
>> I'm a bit skeptic, because 10x improvement is pretty hard to get if the benchmark is not flawed and the code doesn't use VM support.
>
> Okay, I found the benchmark code, and it's flawed. The distribution of the keys is not uniform. Actually it's far from uniform, it's just 1..dictSize.
> Anyway I ran the benchmarks in Squeak 4.2 alpha and got the following results:
>
> PBDictionary: 0.1249 +/-0.0067
> RemoveKey 6.7e-500 +/-4.6e-5
> AtPutNew 0.0648 +/-0.0048
> Do 0.00160 +/-0.0001
> AtPutExisting 0.003300000000000002 +/-0.00018
> AtIfAbsentPut 0.0192 +/-0.0035
> AtPut 0.01557000000000001 +/-0.00028
> IncludesKey 0.0180 +/-0.0029
> KeysAndValuesDo 0.00223 +/-0.0001200000000000001
> Includes 0.000133 +/-6.3e-5
>
> PBSTDictionary: 0.2204 +/-0.0063
> RemoveKey 0.07203 +/-0.00032
> AtPutNew 0.0434 +/-0.0055
> Do 0.00 +/-0.0
> AtPutExisting 0.00260 +/-0.00013
> AtIfAbsentPut 0.01193 +/-0.0002
> AtPut 0.0147 +/-0.0031
> IncludesKey 0.01157000000000001 +/-0.0002400000000000001
> KeysAndValuesDo 0.002200 +/-7.4e-5
> Includes 0.05990000000000003 +/-0.00019
>
> The PDictionary is only better at 2 benchmarks:
> Includes (not IncludesKey!) and RemoveKey which are both rarely used. In all other tests Squeak's Dictionary implementation is faster in this flawed benchmark.
>
>
> Levente
>
>>
>> I wonder what the #pPrimitive:plugin: pragmas stand for in PDictionary's
>> #at:ifAbsent: and #at:put:.
>>
>> Where can I find the benchmark code?
>>
>>> An important thing for our datastructures is that they don't degrade since we don't have colliding hashes; but we didn't really test that yet (although it becomes obvious in some of the benchmark results such as removing of elements and testing presence of elements).
>>>> Is there anything special that one would neet to consider when replacing the old implementation in Pharo with yours?
>>> What doesn't happen at the moment is shrinking dictionaries after they have grown significantly. I wouldn't know at what point to do so either; nor do I think Dictionary does that?
>>
>> The current HashedCollections don't shrink. Remove is a rarely used operation.
>>
>>> There is a parameter indicating how many elements can be added before it switches from a SmallDictionary version to a dictionary with buckets. This is  set to 20 by default, and I have no idea if it makes sense to change it; I didn't really profile what the optimal number is. Maybe it makes sense to make that a constant in the code rather than an instance variable to save a bit of space and time ...
>>> We have our own PHashedCollection subclassing from Collection; so the biggest part of the hierarchy is compatible with what Pharo has. Mostly while porting Pinocchio-specific stuff will have to be removed; mostly annotations related to Pinocchio Primitives and exporting directives. This is all straightforward though.
>>> The current "ratio" of hashes to elements is set to 500%. So basically when you have 32 buckets it will only grow when it contains 160 elements (or key-value pairs), so 5 elements per bucket on average. This seems to not make it slower which is understandable since SmallDictionary is pretty fast for small amounts of elements; but this ratio might have to be tweaked depending on the kind of objects and is currently a parameter. The advantage of using 500% is that you have A LOT less garbage.
>>> Oh, and our dictionaries do use a lot less garbage, since Pharo's dictionaries use association objects for each key-value pair. We don't, so we never generate garbage when you remove an object.
>>> Our Sets on the other hand do use a bit more memory because of the bucket scheme. Sets don't use associations so we don't have an edge there :) Only performance-wise in that case.
>>
>> One could implement the current dictionaries without associations, that's just less object-oriented. That would generate even less "garbage".
>>
>>> The hashes are masked with the amount of buckets - 1, since the buckets are always a power of 2.
>>> This implies that the lower bits of the hash are the most significant ones. This does not work together with the current identityHash / IdentityDictionary since the 12bit hash is shifted to the left by 18 (or so). This would make all the lower 18 bits 0 and all objects thus ending up in the first bucket.
>>> That is a very important thing to note enough. However, I also think it's a bit strange to just bitshift the identityHash by 18. You don't win any significant numbers with it ... you have 12bit hashes; no way to improve that (without changing the image format). Since we overfill our dictionaries however this works quite ok with those bad hashes; and we'll only start wasting a bit of space after you have added 20480 elements to the dictionary (you have 4096 different hashes with 12 bits, * 5).
>>
>> Let's imagine that we want to put 10000 objects into a IdentitySet. Each object has it's identityHash in the 0..4095 range. The hash function is very simple: hash \\ capacity + 1. Therefore the values of the hash function will all be in 1..4096, even if capacity is ~15000. This causes a lot of collisions (the table will be consist of a few very long chains degrading performance to unacceptable leves). These collisions are avoided by the shift.
>>
>>
>> Levente
>>
>>> _______________________________________________
>>> Pharo-project mailing list
>>> [hidden email]
>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>
>> _______________________________________________
>> Pharo-project mailing list
>> [hidden email]
>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project


_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: One order of magnitude faster sets/dicts?!

Levente Uzonyi-2
In reply to this post by Toon Verwaest-2
On Tue, 14 Sep 2010, Toon Verwaest wrote:

> Hi,
>>> I'm a bit skeptic, because 10x improvement is pretty hard to get if the
>>> benchmark is not flawed and the code doesn't use VM support.
>> Okay, I found the benchmark code, and it's flawed. The distribution of the
>> keys is not uniform. Actually it's far from uniform, it's just 1..dictSize.
>> Anyway I ran the benchmarks in Squeak 4.2 alpha and got the following
>> results
> ...
>> The PDictionary is only better at 2 benchmarks:
>> Includes (not IncludesKey!) and RemoveKey which are both rarely used. In
>> all other tests Squeak's Dictionary implementation is faster in this flawed
>> benchmark.
> I agree this benchmark is severely flawed. But it's generally also the ideal
> case for Pharo/Squeak dictionaries. Whatever might happen in other cases to

In one sense, it's ideal. #at:put is fast, because the first check slot
will be an empty slot. But #removeKey:ifAbsent: and #includes: will be
very slow because those will have to check all subsequent slots. Normally
a hash table which uses open addressing with linear probing has short
chains. The average chain length is below a constant which only depends on
the load factor, this is how it provides O(1) time for the operations.

In Squeak the default hash function is very simple and therefore fast. But
it has a caveat. If you provide low quality input, like integers from a
small range (1..10000), it will have bad performance. With
PluggableSet/PluggableDictionary you can change the hash function if you
need it. There is a class method which creates a new dictionary that's
more tolerant to the small integer range and not uniform distribution
issue: PluggableSet integerSet.

> degenerate the dictionary doesn't happen here since keys are added in the
> right order (increasing hashes). So whenever place should be made for a key

So in this case the dictionary is far from normal, since it has only a few
chains which are very long.

> with an already existing hash in Dictionary, this is avoided. In PDictionary
> however this scenario wouldn't even happen. Now the problem is that I don't
> have experience in writing proper benchmarks and this thread could provide us
> with some decent benchmarks that better show standard use of dictionaries
> (and as I hope, show that PDictionaries are more resistant to degradation
> than the current Dictionaries.

Here is the benchmark I used while I was changing Squeak's Dictionary
implementation:
http://leves.web.elte.hu/collections/DictionaryBenchmark.st
I'm sure it could be improved and simplified a bit with the new
collection methods.

>
> What we should take away from the existing benchmarks is that the differences
> aren't even that high for the ideal case of Pharo/Squeak dictionaries.
>
> The main point in favor of the PSet / PDictionary implementations is that you
> always know the exact ranges of keys/values that belong to a certain hash. So
> for includesKey: and at:ifAbsent: this is ideal since you never look further
> than the current hash-value. In Squeak/Pharo dictionary/set you don't know
> this, thanks to hash-collision/propagation, so you have to scan until you
> find an association that is nil. If you are unlucky and your whole dictionary
> has all the associations grouped together, even though they all represent
> keys with different hashes, you'll have to look through a whole bunch of
> totally unrelated keys.

It's the classical separate chaining vs open addressing problem. IMHO open
addressing is better in general: it uses less space, has less cache misses
and has better performance with a proper hash function. Though there are
special cases when it's inferior to separate chaining.

>
> For example, I quickly changed one of the benchmarks to be a bit less optimal
> for the Squeak/Pharo collections, namely the includesKey: benchmark (which is
> also representative for at:)
> Rather than doing
>
>    1 to: dict size * 2 do: [ :i|
>        dict at: (self key: i) ifAbsentPut: (self value: i)].
>
> I changed it to:
>
>    1 to: dict size * 10 by: 10 do: [ :i|
>        dict at: (self key: i) ifAbsentPut: (self value: i)].
>
> This forces the dictionary to miss a lot more often. Since all the keys are
> grouped together this is the worst case for Squeak/Pharo dictionaries if you
> want to look something up that's not there. This are the results on Squeak
> 4.1:
>
> PBDictionary  0.0261 +/-0.0017
> PBSTDictionary  5.235 +/-0.01
>
> So around 200x faster in case of PDictionary.

The values of the hash function are still in 1..100000 and the
distribution of the values is very far from uniform in that small range.

>
> While PDictionary stays stable (even slightly faster, go figure), Squeak
> dictionaries degrade immensely! On Pharo the results are a bit better, but
> still very much degraded:
>
> PBDictionary 0.02387 +/-0.00014
> PBSTDictionary 1.6577 +/-0.0036
>
> So around 60x faster in case of PDictionary.
>
>
> If on Set I ensure that all elements will be found, I get the following
> result for #includes:
>
> PBSetIncludes 0.00440 +/-0.0001100000000000001
> PBSTSetIncludes 0.004000 +/-9.6e-5
>
> making PSet marginally slower. However, once you start not finding elements,
> with the same change as before, (to: size*10 by: 10):
>
> PBSetIncludes 0.00443 +/-0.0001
> PBSTSetIncludes 0.9844000000000005 +/-0.0017
>
> So around 220x faster in case of PSet. While PSet and PDictionary never
> really suffer from collisions, normal Squeak/Pharo dictionaries suffer
> immensely and grow up to 200x slower. Btw, this last benchmark was run on
> Pharo.

See above.

>
>
> Anyway, you are definitely right that the benchmarks suck and that we need
> better ones.
>
> As for PDictionaries and PSets, nowing that you know exactly what the range
> of existing hashes is I think it's only logical that it will be faster in
> many of the operations that need this knowledge, such as #includesKey:,
> #remove:, #at:put:/#add: and #at:/#includes:.

See above.

>
> Since PDictionary switches between SmallDictionary style and normal
> dictionary style, it also seems logical that it behaves faster in case that
> you are in SmallDictionary style without having to care about it. This is
> also not shown in any benchmark, but it is the case and makes it faster for
> that case. However, this added test makes it slower for the general
> dictionary style so removing it might improve things in that case to regain
> the marginal difference between both dictionaries in the ideal case for
> Pharo/Squeak.
>
> There is anyway no reason why PDictionary would be slower than Pharo/Squeak
> dictionaries except for design decisions like the previously mentioned one,
> and those can be changed if wanted. There is however a good reason why
> Squeak/Pharo dictionaries are slower and degrade faster, namely colliding
> hashes that even propagate collisions (also not benchmarked at the moment!).

Set has better cache locality than PSet, I think Dictionary is similar to
PDictionary in this regard.

>
> Ok, on the whole I wasn't really planning on pushing these dictionaries too
> much on Pharo / Squeak. We use them in Pinocchio and are happy with them
> (especially since we have native support). But they are there and can be used
> if you want to. I can understand that people are skeptical, but since I don't
> really have knowledge on building benchmarks (I didn't even set up the
> current ones, it was a student of mine) I don't think I can help much there.
> We use them throughout our project and some parts (such as our parser) have
> gotten a lot faster since we started using them.

I'm not saying that a hashed collection can't be faster than the current
implementation, but a few people (including me) implemented serveral
variants over the years and the current idea was found the best so far
for general purpose.
Performance can easily be boosted by primitives, especially in special
cases, like this: http://leves.web.elte.hu/LargeIdentityDictionary/ ,
where I used the existing #pointsTo: primitive to boost #includesKey:
(http://leves.web.elte.hu/LargeIdentityDictionary/LargeIdentityDictionary2.png 
). With a better primitive (for this purpose) all operations can be
boosted drastically, but I prefer code written in Smalltalk. I think Cog
will be a lot faster in the near future, so the performance gap will be
smaller.

>
> Btw, the code would indeed just be released under MIT license which is maybe
> already automatically the case since I signed that agreement as a contributor
> to Pharo?

If you didn't upload your code to a Pharo repository, then it's not
automatically MIT just because you signed the agreement.

>
>
> Some answers / comments to the other random questions and remarks:
>> I wonder what the #pPrimitive:plugin: pragmas stand for in PDictionary's
>> #at:ifAbsent: and #at:put:.
> Those methods are implemented as primitives in the Pinocchio VM that we are
> developing. In primitive form most of the methods run ~ twice as fast. This
> is however not compatible with the way the stack is handled in Squeak/Pharo
> so can't really be ported.
>> The current HashedCollections don't shrink. Remove is a rarely used
>> operation.
> Ok, makes sense.
>> One could implement the current dictionaries without associations, that's
>> just less object-oriented. That would generate even less "garbage".
> True enough.
>> Let's imagine that we want to put 10000 objects into a IdentitySet. Each
>> object has it's identityHash in the 0..4095 range. The hash function is
>> very simple: hash \\ capacity + 1. Therefore the values of the hash
>> function will all be in 1..4096, even if capacity is ~15000. This causes a
>> lot of collisions (the table will be consist of a few very long chains
>> degrading performance to unacceptable leves). These collisions are avoided
>> by the shift.
> So you change the hash implementation to ensure that they take up the whole
> array of the Dictionary rather than just the first 4096 + collisions
> positions. Makes perfect sense for the squeak/pharo implementation but rather
> strange to enforce this on the whole system imho.

Well, it's only that way in Pharo. In Squeak we did it differently:
#identityHash returns the value of the primitive (0..4095) and
#scaledIdentityHash does the shifting (0..SmallInteger maxVal). So "old"
code that uses #identityHash doesn't break and doesn't have performance
issues.

>
>
> Ok, I hope this clears up some things.

Yes, thanks.


Levente

>
> cheers,
> Toon
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project