String hash function

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

String hash function

Clément Béra
Hi all,

I am trying to investigate performance overhead in benchmarks to improve the VM performance. In a benchmark strings are used as dictionary keys, which seems to be an OK pattern to me that may be actually used in real code from time to time. Something stroke me: the hash function of a string is actually iterating over the whole string. 

As a result, I get something like this:

#(
'String size: 0, Hash function performance: 17,817,776 per second' 
'String size: 1, Hash function performance: 15,395,388 per second' 
'String size: 3, Hash function performance: 14,853,109 per second' 
'String size: 4, Hash function performance: 15,151,954 per second' 
'String size: 5, Hash function performance: 14,139,881 per second' 
'String size: 10, Hash function performance: 11,545,749 per second' 
'String size: 65, Hash function performance: 3,431,067 per second' 
'String size: 130, Hash function performance: 1,879,047 per second' 
'String size: 520, Hash function performance: 511,934 per second'
)

This means that if a hash of a 1MB String is computed it takes a hell of a time. 

I wonder why we don't compute the string hash from a small number of inner characters taken randomly but deterministically from the String. From example, for all strings longer than 5 characters, one could pick only 5 characters in the String to compute its hash. The hash function is this way decent enough and the performance does not drop for large strings.

One implementation could be to replace:
1 to: stringSize do: [:pos |
by:
1 to: stringSize by: stringSize // 5 do: [:pos |
in:
ByteString>>stringHash: aString initialHash: speciesHash

Then we need to rehash the whole image.

What do you think ? Is there a hash expert around that can confirm this is a good idea ? 
Reply | Threaded
Open this post in threaded view
|

Re: String hash function

Max Leske
That would radically increase the number of collisions from the hash function. I don’t think that is a good idea. *Especially* when we are talking about a hash function that will be used for dictionaries. When you test the performance of a hash function you should also test it’s properties w.r.t. to hashed collections (Andrés Valloud explains that nicely in “Hashing in Smalltalk”).

Cheers,
Max

On 13 Apr 2017, at 23:19, Clément Bera <[hidden email]> wrote:

Hi all,

I am trying to investigate performance overhead in benchmarks to improve the VM performance. In a benchmark strings are used as dictionary keys, which seems to be an OK pattern to me that may be actually used in real code from time to time. Something stroke me: the hash function of a string is actually iterating over the whole string. 

As a result, I get something like this:

#(
'String size: 0, Hash function performance: 17,817,776 per second' 
'String size: 1, Hash function performance: 15,395,388 per second' 
'String size: 3, Hash function performance: 14,853,109 per second' 
'String size: 4, Hash function performance: 15,151,954 per second' 
'String size: 5, Hash function performance: 14,139,881 per second' 
'String size: 10, Hash function performance: 11,545,749 per second' 
'String size: 65, Hash function performance: 3,431,067 per second' 
'String size: 130, Hash function performance: 1,879,047 per second' 
'String size: 520, Hash function performance: 511,934 per second'
)

This means that if a hash of a 1MB String is computed it takes a hell of a time. 

I wonder why we don't compute the string hash from a small number of inner characters taken randomly but deterministically from the String. From example, for all strings longer than 5 characters, one could pick only 5 characters in the String to compute its hash. The hash function is this way decent enough and the performance does not drop for large strings.

One implementation could be to replace:
1 to: stringSize do: [:pos |
by:
1 to: stringSize by: stringSize // 5 do: [:pos |
in:
ByteString>>stringHash: aString initialHash: speciesHash

Then we need to rehash the whole image.

What do you think ? Is there a hash expert around that can confirm this is a good idea ? 

Reply | Threaded
Open this post in threaded view
|

Re: String hash function

Martin McClure-2
In reply to this post by Clément Béra
On 04/13/2017 02:19 PM, Clément Bera wrote:

> I wonder why we don't compute the string hash from a small number of
> inner characters taken randomly but deterministically from the String.
> From example, for all strings longer than 5 characters, one could pick
> only 5 characters in the String to compute its hash. The hash function
> is this way decent enough and the performance does not drop for large
> strings.
>
> One implementation could be to replace:
> 1 to: stringSize do: [:pos |
> by:
> 1 to: stringSize by: stringSize // 5 do: [:pos |
> in:
> ByteString>>stringHash: aString initialHash: speciesHash
>
> Then we need to rehash the whole image.
>
> What do you think ? Is there a hash expert around that can confirm this
> is a good idea ?

This kind of thing has been tried before, and failed miserably. One of
the biggest breaking incompatibilities, IIRC, between Java 1 and Java 2
was that they had to change their string hash function to hash all
characters, because their scheme of hashing only a subset of characters
had a huge number of collisions in real-world string like URLs, which
have a lot of characters in common.

I think it's worth the time to hash all characters.

Regards,

-Martin

Reply | Threaded
Open this post in threaded view
|

Re: [Vm-dev] String hash function

Andres Valloud-4
In reply to this post by Clément Béra
IME, being smart about which characters to hash quickly runs into data
sets being smarter about hiding variation in the unsampled characters.

Slower hash functions aren't necessarily bad.  Practical experience:

1.  Investment bank end-of-day report takes 30 minutes to run.  The time
is spent mostly in linear search because of hash collisions.  Introduced
a 5x slower hash function that had a collision rate of about 2.5 instead
of 20+.  The report now takes 90 seconds, and hashing is nowhere to be
seen in the profiler output.  No need to optimize hashing further.

2.  VisualWorks string / symbol hashes were also "smart".  Replaced with
a multiplicative hash along the lines of the Horner rule, which is
slower for large enough strings.  But interning all symbols into a new
symbol table now takes 7x less time, and one of the k-nucleotide-related
Computer Language Shootout benchmarks now runs in half the time.

Regarding huge strings, surely a 1mb string has a special meaning that
could be encapsulated within an object, and it is *this* object that
could provide a hash function suitable for the purpose at hand.  In
fact, especially in those cases, the returned hash value might not even
need to be based on the 1mb string at all.

On 4/13/17 14:19 , Clément Bera wrote:

>
>
>
>
> Hi all,
>
> I am trying to investigate performance overhead in benchmarks to improve
> the VM performance. In a benchmark strings are used as dictionary keys,
> which seems to be an OK pattern to me that may be actually used in real
> code from time to time. Something stroke me: _the hash function of a
> string is actually iterating over the whole string. _
>
> As a result, I get something like this:
>
> #(
> 'String size: 0, Hash function performance: 17,817,776 per second'
> 'String size: 1, Hash function performance: 15,395,388 per second'
> 'String size: 3, Hash function performance: 14,853,109 per second'
> 'String size: 4, Hash function performance: 15,151,954 per second'
> 'String size: 5, Hash function performance: 14,139,881 per second'
> 'String size: 10, Hash function performance: 11,545,749 per second'
> 'String size: 65, Hash function performance: 3,431,067 per second'
> 'String size: 130, Hash function performance: 1,879,047 per second'
> 'String size: 520, Hash function performance: 511,934 per second'
> )
>
> This means that if a hash of a 1MB String is computed it takes a hell of
> a time.
>
> I wonder why we don't compute the string hash from a small number of
> inner characters taken randomly but deterministically from the String.
> From example, for all strings longer than 5 characters, one could pick
> only 5 characters in the String to compute its hash. The hash function
> is this way decent enough and the performance does not drop for large
> strings.
>
> One implementation could be to replace:
> 1 to: stringSize do: [:pos |
> by:
> 1 to: stringSize by: stringSize // 5 do: [:pos |
> in:
> ByteString>>stringHash: aString initialHash: speciesHash
>
> Then we need to rehash the whole image.
>
> What do you think ? Is there a hash expert around that can confirm this
> is a good idea ?