Smalltalk › Squeak › Squeak VM

String hash function

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

13 messages Options

Clément Béra

String hash function

Hi all,

I am trying to investigate performance overhead in benchmarks to improve the VM performance. In a benchmark strings are used as dictionary keys, which seems to be an OK pattern to me that may be actually used in real code from time to time. Something stroke me: the hash function of a string is actually iterating over the whole string.

As a result, I get something like this:

'String size: 0, Hash function performance: 17,817,776 per second'

'String size: 1, Hash function performance: 15,395,388 per second'

'String size: 3, Hash function performance: 14,853,109 per second'

'String size: 4, Hash function performance: 15,151,954 per second'

'String size: 5, Hash function performance: 14,139,881 per second'

'String size: 10, Hash function performance: 11,545,749 per second'

'String size: 65, Hash function performance: 3,431,067 per second'

'String size: 130, Hash function performance: 1,879,047 per second'

'String size: 520, Hash function performance: 511,934 per second'

)

This means that if a hash of a 1MB String is computed it takes a hell of a time.

I wonder why we don't compute the string hash from a small number of inner characters taken randomly but deterministically from the String. From example, for all strings longer than 5 characters, one could pick only 5 characters in the String to compute its hash. The hash function is this way decent enough and the performance does not drop for large strings.

One implementation could be to replace:

1 to: stringSize do: [:pos |

by:

1 to: stringSize by: stringSize // 5 do: [:pos |

in:

ByteString>>stringHash: aString initialHash: speciesHash

Then we need to rehash the whole image.

What do you think ? Is there a hash expert around that can confirm this is a good idea ?

timrowledge

Re: String hash function

> On 13-04-2017, at 2:19 PM, Clément Bera <[hidden email]> wrote:
>
>
> This means that if a hash of a 1MB String is computed it takes a hell of a time.
>
> I wonder why we don't compute the string hash from a small number of inner characters taken randomly but deterministically from the String. From example, for all strings longer than 5 characters, one could pick only 5 characters in the String to compute its hash.

Yay! We haven’t had a good hash function argument in *ages*!

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Klingon Code Warrior:- 5) "Specs are for the weak and timid!"

Levente Uzonyi

Re: String hash function

In reply to this post by Clément Béra

Hi Clément,

On Thu, 13 Apr 2017, Clément Bera wrote:

> What do you think ? Is there a hash expert around that can confirm this
is a good idea ?

I'm not a hash expert. But while I'm sure it's tempting to make this
change, I'm also sure it won't be a good idea. Why?

What makes a hash function good?
A good hash function is expected to give different values for similar, but
not equal inputs. If you omit information, as you proposed, you can't
satisfy this requirement.

Let's say we have strings of length 10, because those are the shortest
strings affected by your proposal. Let the strings have the form
'ProductXXXX' where XXXX are numbers identifying a given product type.

Let's see what would happen with the current implementation (step=1) and
your suggested change (step=2) to the different hash values (code has
extra complexity to be cross-dialect):

#(1 2) collect: [ :step |
step ->
((0 to: 9999)
collect: [ :index |
| aString speciesHash stringSize hash low |
aString := 'Product', ('0000' first: (4 - (index numberOfDigitsInBase: 10))), index asString.
speciesHash := ByteString identityHash.
stringSize := aString size.
hash := speciesHash bitAnd: 16rFFFFFFF.
1 to: stringSize by: step do: [:pos |
hash := hash + (aString basicAt: pos).
"Begin hashMultiply"
low := hash bitAnd: 16383.
hash := (16r260D * low + ((16r260D * (hash bitShift: -14) + (16r0065 * low) bitAnd: 16383) * 16384)) bitAnd:
16r0FFFFFFF ].
hash ]
as: Set) size ]

The result is: {1->10000. 2->100}.

So the number of different hash values went down from the optimal 10000 to
100 - aka 1%. What's even worse is that the problem exists if you take any
small consecutive range, only the ratio changes.

So, IMO the implementation should stay as is - the hash should include
information of all bytes of the string.

The good news is that there is PluggableDictionary, which allows you to
implement your suggestion on the image side. Actually, if we had a really
good JIT - the overhead of using blocks were negligible, then we could
just make Dictionary itself pluggable and collapse large parts of the
HashedCollection hierarchy.

Levente

Max Leske

Re: [Pharo-dev] String hash function

In reply to this post by Clément Béra

That would radically increase the number of collisions from the hash function. I don’t think that is a good idea. *Especially* when we are talking about a hash function that will be used for dictionaries. When you test the performance of a hash function you should also test it’s properties w.r.t. to hashed collections (Andrés Valloud explains that nicely in “Hashing in Smalltalk”).

Cheers,

Max

On 13 Apr 2017, at 23:19, Clément Bera <[hidden email]> wrote:

Hi all,

I am trying to investigate performance overhead in benchmarks to improve the VM performance. In a benchmark strings are used as dictionary keys, which seems to be an OK pattern to me that may be actually used in real code from time to time. Something stroke me: the hash function of a string is actually iterating over the whole string.

As a result, I get something like this:

#(
'String size: 0, Hash function performance: 17,817,776 per second'
'String size: 1, Hash function performance: 15,395,388 per second'
'String size: 3, Hash function performance: 14,853,109 per second'
'String size: 4, Hash function performance: 15,151,954 per second'
'String size: 5, Hash function performance: 14,139,881 per second'
'String size: 10, Hash function performance: 11,545,749 per second'
'String size: 65, Hash function performance: 3,431,067 per second'
'String size: 130, Hash function performance: 1,879,047 per second'
'String size: 520, Hash function performance: 511,934 per second'
)

This means that if a hash of a 1MB String is computed it takes a hell of a time.

I wonder why we don't compute the string hash from a small number of inner characters taken randomly but deterministically from the String. From example, for all strings longer than 5 characters, one could pick only 5 characters in the String to compute its hash. The hash function is this way decent enough and the performance does not drop for large strings.

One implementation could be to replace:
1 to: stringSize do: [:pos |
by:
1 to: stringSize by: stringSize // 5 do: [:pos |
in:
ByteString>>stringHash: aString initialHash: speciesHash

Then we need to rehash the whole image.

What do you think ? Is there a hash expert around that can confirm this is a good idea ?

Nicolas Cellier

Re: [Pharo-dev] String hash function

However, since Symbol are unique, we could have a different hashing policy.

Like for example pinning Symbol and then using identity hash.

It would probably complexify the handing of memory, but that's the deal.

Currently we can't because images are relying on #foo = 'foo', so we have to use the String hash.

But it does not have to be so. I have tried to remove this equivalence long ago (in squeak 3.9 maybe), and it was not such a big deal if I remember...

2017-04-14 6:05 GMT+02:00 Max Leske <[hidden email]>:

That would radically increase the number of collisions from the hash function. I don’t think that is a good idea. *Especially* when we are talking about a hash function that will be used for dictionaries. When you test the performance of a hash function you should also test it’s properties w.r.t. to hashed collections (Andrés Valloud explains that nicely in “Hashing in Smalltalk”).

Cheers,
Max

On 13 Apr 2017, at 23:19, Clément Bera <[hidden email]> wrote:

Hi all,

I am trying to investigate performance overhead in benchmarks to improve the VM performance. In a benchmark strings are used as dictionary keys, which seems to be an OK pattern to me that may be actually used in real code from time to time. Something stroke me: the hash function of a string is actually iterating over the whole string.

As a result, I get something like this:

#(
'String size: 0, Hash function performance: 17,817,776 per second'
'String size: 1, Hash function performance: 15,395,388 per second'
'String size: 3, Hash function performance: 14,853,109 per second'
'String size: 4, Hash function performance: 15,151,954 per second'
'String size: 5, Hash function performance: 14,139,881 per second'
'String size: 10, Hash function performance: 11,545,749 per second'
'String size: 65, Hash function performance: 3,431,067 per second'
'String size: 130, Hash function performance: 1,879,047 per second'
'String size: 520, Hash function performance: 511,934 per second'
)

This means that if a hash of a 1MB String is computed it takes a hell of a time.

I wonder why we don't compute the string hash from a small number of inner characters taken randomly but deterministically from the String. From example, for all strings longer than 5 characters, one could pick only 5 characters in the String to compute its hash. The hash function is this way decent enough and the performance does not drop for large strings.

One implementation could be to replace:
1 to: stringSize do: [:pos |
by:
1 to: stringSize by: stringSize // 5 do: [:pos |
in:
ByteString>>stringHash: aString initialHash: speciesHash

Then we need to rehash the whole image.

What do you think ? Is there a hash expert around that can confirm this is a good idea ?

Nicolas Cellier

Re: [Pharo-dev] String hash function

2017-04-14 8:03 GMT+02:00 Nicolas Cellier <[hidden email]>:

However, since Symbol are unique, we could have a different hashing policy.
Like for example pinning Symbol and then using identity hash.
It would probably complexify the handing of memory, but that's the deal.

oops forget it, the identity hash does not rely on object position in memory, so no need pinning.

Currently we can't because images are relying on #foo = 'foo', so we have to use the String hash.
But it does not have to be so. I have tried to remove this equivalence long ago (in squeak 3.9 maybe), and it was not such a big deal if I remember...

2017-04-14 6:05 GMT+02:00 Max Leske <[hidden email]>:

That would radically increase the number of collisions from the hash function. I don’t think that is a good idea. *Especially* when we are talking about a hash function that will be used for dictionaries. When you test the performance of a hash function you should also test it’s properties w.r.t. to hashed collections (Andrés Valloud explains that nicely in “Hashing in Smalltalk”).

Cheers,
Max

On 13 Apr 2017, at 23:19, Clément Bera <[hidden email]> wrote:

Hi all,

I am trying to investigate performance overhead in benchmarks to improve the VM performance. In a benchmark strings are used as dictionary keys, which seems to be an OK pattern to me that may be actually used in real code from time to time. Something stroke me: the hash function of a string is actually iterating over the whole string.

As a result, I get something like this:

#(
'String size: 0, Hash function performance: 17,817,776 per second'
'String size: 1, Hash function performance: 15,395,388 per second'
'String size: 3, Hash function performance: 14,853,109 per second'
'String size: 4, Hash function performance: 15,151,954 per second'
'String size: 5, Hash function performance: 14,139,881 per second'
'String size: 10, Hash function performance: 11,545,749 per second'
'String size: 65, Hash function performance: 3,431,067 per second'
'String size: 130, Hash function performance: 1,879,047 per second'
'String size: 520, Hash function performance: 511,934 per second'
)

This means that if a hash of a 1MB String is computed it takes a hell of a time.

I wonder why we don't compute the string hash from a small number of inner characters taken randomly but deterministically from the String. From example, for all strings longer than 5 characters, one could pick only 5 characters in the String to compute its hash. The hash function is this way decent enough and the performance does not drop for large strings.

One implementation could be to replace:
1 to: stringSize do: [:pos |
by:
1 to: stringSize by: stringSize // 5 do: [:pos |
in:
ByteString>>stringHash: aString initialHash: speciesHash

Then we need to rehash the whole image.

What do you think ? Is there a hash expert around that can confirm this is a good idea ?

Ralph Boland

Re: String hash function

In reply to this post by Clément Béra

I had an application where I needed to store arrays into sets
and performance depended heavily on this operation.
Standard array classes were unacceptable because hashing was either
too poor (then) or too expensive (now).

So I created my own array class (say HashArray, I forget the name now).
HashArray stored a hash value rather than compute one.
The hash value of an array was the XOR of its elements (which in my
application have good hash values).
This worked very well but came at a price: each time an element was
added or removed or changed
for a HashArray instance the hash value had to be recomputed (O(1) cost).
For my application this wasn't a problem so my solution worked well.
Notes:
1) In my case two arrays containing the same elements should have
the same hash value.
Your situation might be different. If so you could also add the
the hash value some value
associated with the array itself.
2) While contained in a set the array couldn't be changed (didn't
happen in my application) but this
is true for most objects anyway since its has value cannot change;
just something to be aware of.
3) In my application the elements of the array also stored their
hash values. The has values were generated using
a random number generator. This could be done in my application
but obviously can't always be done.
This gave very good hashing for my arrays.

If you are having performance problems with array classes because of
hashing perhaps you should try a scheme
similar to mine. Whether or not such a class should be part of
Squeak, I don't know. Perhaps there should be
a package of HashArray like classes that could be loaded.

Ralph Boland

KenDickey

Re: String hash function

In reply to this post by Levente Uzonyi

> So the number of different hash values went down from the optimal 10000 to
100 - aka 1%. What's even worse is that the problem exists if you take any
small consecutive range, only the ratio changes.

The speed tradeoff seems most acute for large chunks of text.

If the original strategy were used for strings of size less than (say) 50 and the "sampling" strategy were used for longer strings, with the string length included in the hash, then a large chunk of text with one character added would likely not hash collide.

A large string scale app could tune the hash function as has been suggested if performance were poor and tuning would be done by someone likely to be oriented to the problem, while performance would be good in the general case.

$0.02
-KenD

-KenD

Martin McClure-2

Re: [Pharo-dev] String hash function

In reply to this post by Clément Béra

On 04/13/2017 02:19 PM, Clément Bera wrote:

> I wonder why we don't compute the string hash from a small number of
> inner characters taken randomly but deterministically from the String.
> From example, for all strings longer than 5 characters, one could pick
> only 5 characters in the String to compute its hash. The hash function
> is this way decent enough and the performance does not drop for large
> strings.
>
> One implementation could be to replace:
> 1 to: stringSize do: [:pos |
> by:
> 1 to: stringSize by: stringSize // 5 do: [:pos |
> in:
> ByteString>>stringHash: aString initialHash: speciesHash
>
> Then we need to rehash the whole image.
>
> What do you think ? Is there a hash expert around that can confirm this
> is a good idea ?

This kind of thing has been tried before, and failed miserably. One of
the biggest breaking incompatibilities, IIRC, between Java 1 and Java 2
was that they had to change their string hash function to hash all
characters, because their scheme of hashing only a subset of characters
had a huge number of collisions in real-world string like URLs, which
have a lot of characters in common.

I think it's worth the time to hash all characters.

Regards,

-Martin

Andres Valloud-4

Re: String hash function

In reply to this post by Clément Béra

IME, being smart about which characters to hash quickly runs into data
sets being smarter about hiding variation in the unsampled characters.

Slower hash functions aren't necessarily bad. Practical experience:

1. Investment bank end-of-day report takes 30 minutes to run. The time
is spent mostly in linear search because of hash collisions. Introduced
a 5x slower hash function that had a collision rate of about 2.5 instead
of 20+. The report now takes 90 seconds, and hashing is nowhere to be
seen in the profiler output. No need to optimize hashing further.

2. VisualWorks string / symbol hashes were also "smart". Replaced with
a multiplicative hash along the lines of the Horner rule, which is
slower for large enough strings. But interning all symbols into a new
symbol table now takes 7x less time, and one of the k-nucleotide-related
Computer Language Shootout benchmarks now runs in half the time.

Regarding huge strings, surely a 1mb string has a special meaning that
could be encapsulated within an object, and it is *this* object that
could provide a hash function suitable for the purpose at hand. In
fact, especially in those cases, the returned hash value might not even
need to be based on the 1mb string at all.

On 4/13/17 14:19 , Clément Bera wrote:

>
>
>
>
> Hi all,
>
> I am trying to investigate performance overhead in benchmarks to improve
> the VM performance. In a benchmark strings are used as dictionary keys,
> which seems to be an OK pattern to me that may be actually used in real
> code from time to time. Something stroke me: _the hash function of a
> string is actually iterating over the whole string. _
>
> As a result, I get something like this:
>
> #(
> 'String size: 0, Hash function performance: 17,817,776 per second'
> 'String size: 1, Hash function performance: 15,395,388 per second'
> 'String size: 3, Hash function performance: 14,853,109 per second'
> 'String size: 4, Hash function performance: 15,151,954 per second'
> 'String size: 5, Hash function performance: 14,139,881 per second'
> 'String size: 10, Hash function performance: 11,545,749 per second'
> 'String size: 65, Hash function performance: 3,431,067 per second'
> 'String size: 130, Hash function performance: 1,879,047 per second'
> 'String size: 520, Hash function performance: 511,934 per second'
> )
>
> This means that if a hash of a 1MB String is computed it takes a hell of
> a time.
>
> I wonder why we don't compute the string hash from a small number of
> inner characters taken randomly but deterministically from the String.
> From example, for all strings longer than 5 characters, one could pick
> only 5 characters in the String to compute its hash. The hash function
> is this way decent enough and the performance does not drop for large
> strings.
>
> One implementation could be to replace:
> 1 to: stringSize do: [:pos |
> by:
> 1 to: stringSize by: stringSize // 5 do: [:pos |
> in:
> ByteString>>stringHash: aString initialHash: speciesHash
>
> Then we need to rehash the whole image.
>
> What do you think ? Is there a hash expert around that can confirm this
> is a good idea ?

Ben Coman

Re: String hash function

On Sat, Apr 15, 2017 at 3:34 AM, Andres Valloud <[hidden email]> wrote:

IME, being smart about which characters to hash quickly runs into data sets being smarter about hiding variation in the unsampled characters.

Slower hash functions aren't necessarily bad. Practical experience:

1. Investment bank end-of-day report takes 30 minutes to run. The time is spent mostly in linear search because of hash collisions. Introduced a 5x slower hash function that had a collision rate of about 2.5 instead of 20+. The report now takes 90 seconds, and hashing is nowhere to be seen in the profiler output. No need to optimize hashing further.

2. VisualWorks string / symbol hashes were also "smart". Replaced with a multiplicative hash along the lines of the Horner rule, which is slower for large enough strings. But interning all symbols into a new symbol table now takes 7x less time, and one of the k-nucleotide-related Computer Language Shootout benchmarks now runs in half the time.

Regarding huge strings, surely a 1mb string has a special meaning that could be encapsulated within an object, and it is *this* object that could provide a hash function suitable for the purpose at hand. In fact, especially in those cases, the returned hash value might not even need to be based on the 1mb string at all.

This seems like the best practical advice.

What are opinions on having a HashedString subclassed from String to optimise calculating hash only once?

a. In the standard Image or leave it specific to applications?

b. Obviously an ivar to store the hash computed at instantiation.

c. Maybe have an ivar storing a custom hash-block? Equality would check this is identical between objects.

cheers -ben

On 4/13/17 14:19 , Clément Bera wrote:

Hi all,

I am trying to investigate performance overhead in benchmarks to improve
the VM performance. In a benchmark strings are used as dictionary keys,
which seems to be an OK pattern to me that may be actually used in real
code from time to time. Something stroke me: _the hash function of a
string is actually iterating over the whole string. _

As a result, I get something like this:

#(
'String size: 0, Hash function performance: 17,817,776 per second'
'String size: 1, Hash function performance: 15,395,388 per second'
'String size: 3, Hash function performance: 14,853,109 per second'
'String size: 4, Hash function performance: 15,151,954 per second'
'String size: 5, Hash function performance: 14,139,881 per second'
'String size: 10, Hash function performance: 11,545,749 per second'
'String size: 65, Hash function performance: 3,431,067 per second'
'String size: 130, Hash function performance: 1,879,047 per second'
'String size: 520, Hash function performance: 511,934 per second'
)

This means that if a hash of a 1MB String is computed it takes a hell of
a time.

I wonder why we don't compute the string hash from a small number of
inner characters taken randomly but deterministically from the String.
From example, for all strings longer than 5 characters, one could pick
only 5 characters in the String to compute its hash. The hash function
is this way decent enough and the performance does not drop for large
strings.

One implementation could be to replace:
1 to: stringSize do: [:pos |
by:
1 to: stringSize by: stringSize // 5 do: [:pos |
in:
ByteString>>stringHash: aString initialHash: speciesHash

Then we need to rehash the whole image.

What do you think ? Is there a hash expert around that can confirm this
is a good idea ?

Andres Valloud-4

Re: String hash function

If it were up to me, I'd leave that decision to the applications. No
need to optimize unknown future problems. Also keep in mind
applications likely should refine strings by composition and delegation
rather than by inheritance in this particular case. Otherwise,

String
Base64EncodedMP3File
PostgresqlDataBlob
SmalltalkMethodSourceCode

... and it stops making sense quickly.

On 4/14/17 16:12 , Ben Coman wrote:
> What are opinions on having a HashedString subclassed from String to
> optimise calculating hash only once?
>
> a. In the standard Image or leave it specific to applications?
>
> b. Obviously an ivar to store the hash computed at instantiation.
>
> c. Maybe have an ivar storing a custom hash-block? Equality would
> check this is identical between objects.

timrowledge

Re: String hash function

> On 14-04-2017, at 4:38 PM, Andres Valloud <[hidden email]> wrote:
>
> If it were up to me, I'd leave that decision to the applications. No need to optimize unknown future problems.

Exactly; remember the rules for optimising -

1) Don’t.
2) (for experts only) Don’t, *yet*.

tim
--
tim Rowledge; [hidden email]; http://www.rowledge.org/tim
Useful Latin Phrases:- O! Plus! Perge! Aio! Hui! Hem! = Oh! More! Go on! Yes! Ooh! Ummm!