Byte & String collection hash performance; a modest proposal for change.

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Byte & String collection hash performance; a modest proposal for change.

Eliot Miranda-2
Hi All,

    the hash algorithm used for ByteString in Squeak and Pharo is good for "small" strings and overkill for large strings.  It is important in many applications to get well distributed string hashes, especially over the range of strings that constitute things like method names, URLs, etc.  Consequently, the current algorithm includes every character in a string.  This works very well for "small" strings and results in very slow hashes (and hence long latencies, because the hash is an uninterruptible primitive) for large strings, where large may be several megabytes.

Let's look at the basic hash algorithm.  The following method is translated my compiler machinery in VMMaker from Smalltalk to C.  It creates a primitive function called primitiveStringHash, and so when invoked in normal Smalltalk code the method below invokes its C translation; neat.

ByteArray>>hashBytes: aByteArray startingWith: speciesHash
"Answer the hash of a byte-indexed collection, using speciesHash as the initial value.
See SmallInteger>>hashMultiply.

The primitive should be renamed at a suitable point in the future"
<primitive: 'primitiveStringHash' module: 'MiscPrimitivePlugin'>
| byteArraySize hash |
<var: 'aByteArray' type: #'unsigned char *'>
<var: 'speciesHash' type: #int>

byteArraySize := aByteArray size.
hash := speciesHash bitAnd: 16rFFFFFFF.
1 to: byteArraySize do:
[:pos |
hash := hash + (aByteArray basicAt: pos).
"Inlined hashMultiply, written this way for translation to C."
hash := hash * 1664525 bitAnd: 16r0FFFFFFF].
^hash

This function is invokes by a rather convoluted chain:

String>>hash
"#hash is implemented, because #= is implemented"
"ar 4/10/2005: I had to change this to use ByteString hash as initial 
hash in order to avoid having to rehash everything and yet compute
the same hash for ByteString and WideString.
md 16/10/2006: use identityHash as initialHash, as behavior hash will 
use String hash (name) to have a better hash soon.
eem 4/17/2017 it's not possible to use String hash (name) for the
initial hash because that would be recursive."
^self class stringHash: self initialHash: ByteString identityHash

ByteString class>>stringHash: aString initialHash: speciesHash
"Answer the hash of a byte-indexed string, using speciesHash as the initial value.
See SmallInteger>>hashMultiply."
<primitive: 'primitiveStringHash' module: 'MiscPrimitivePlugin'>
| hash |
hash := speciesHash bitAnd: 16rFFFFFFF.
1 to: aString size do:
[:pos |
hash := (hash + (aString basicAt: pos)) hashMultiply].
^hash

and the generic string implementation is

String class>>stringHash: aString initialHash: speciesHash
"Answer the hash of a byte-indexed string, using speciesHash as the initial value.
 See SmallInteger>>hashMultiply."
| hash |
hash := speciesHash bitAnd: 16rFFFFFFF.
1 to: aString size do:
[:pos |
hash := (hash + (aString basicAt: pos)) hashMultiply].
^hash

(it simply omits the primitive declaration).


As of yesterday the inner loop was written differently by Andres Valoud to avoid overflow:

hash := hash + (aByteArray basicAt: pos).
"Begin hashMultiply"
low := hash bitAnd: 16383.
hash := (16r260D * low + ((16r260D * (hash bitShift: -14) + (16r0065 * low) bitAnd: 16383) * 16384)) bitAnd: 16r0FFFFFFF.

The problem here is that the Smalltalk-to-C translation machinery is naive and entirely incapable of transforming 

low := hash bitAnd: 16383.
hash := (16r260D * low + ((16r260D * (hash bitShift: -14) + (16r0065 * low) bitAnd: 16383) * 16384)) bitAnd: 16r0FFFFFFF.

into
hash := hash * 1664525 bitAnd: 16r0FFFFFFF

The reformulation makes the primitive a little quicker, gaining for larger strings, but still suffers the high invocation overhead as described in the Cog Primitive Performance thread.

In looking at this I've added a primitive for hashMultiply; primitive #159 implements precisely self * 1664525 bitAnd: 16r0FFFFFFF for SmallInteger and LargePositiveInteger receivers, as fast as possible in the Cog JIT.  With this machinery in place it's instructive to compare the cost of the primitive against the non-primitive Smalltalk code.

First let me introduce a set of replacement hash functions, newHashN.  These hash all characters in strings up to a certain size, and then no more than that number for larger strings.  Here are newHash64 and newHash2048, which use pure Smalltalk, including an inlined hashMultiply written to avoid SmallInteger overflow.  Also measured are the obvious variants newHash128, newHash256, newHash512 & mewHash1024.

String>>newHash64
"#hash is implemented, because #= is implemented"
"choice of primes:
(HashedCollection goodPrimes select: [:n| n bitCount = (n highBit // 2) and: [n <= 16rFFFFFFF]]) collect: [:ea| {ea. ea hex}]"
| size hash |
size := self size.
size = 0 ifTrue: [^214748357 "16rCCCCCC5"].
hash := size < 262144
ifTrue: [size * 2617 "16rA39"]
ifFalse: [size + (size >> 16)].
1 to: size by: (size // 32 max: 1) do: "At most 63 characters"
[:i| | low |
hash := hash + (self basicAt: i).
"hash multiply"
low := hash bitAnd: 16383.
hash := (16r260D * low + ((16r260D * (hash bitShift: -14) + (16r0065 * low) bitAnd: 16383) * 16384)) bitAnd: 16r0FFFFFFF].
^hash

String>>newHash2048
"#hash is implemented, because #= is implemented"
"choice of primes:
(HashedCollection goodPrimes select: [:n| n bitCount = (n highBit // 2) and: [n <= 16rFFFFFFF]]) collect: [:ea| {ea. ea hex}]"
| size hash |
size := self size.
size = 0 ifTrue: [^214748357 "16rCCCCCC5"].
hash := size < 262144
ifTrue: [size * 2617 "16rA39"]
ifFalse: [size + (size >> 16)].
1 to: size by: (size // 1024 max: 1) do: "At most 2047 characters"
[:i| | low |
hash := hash + (self basicAt: i).
"hash multiply"
low := hash bitAnd: 16383.
hash := (16r260D * low + ((16r260D * (hash bitShift: -14) + (16r0065 * low) bitAnd: 16383) * 16384)) bitAnd: 16r0FFFFFFF].
^hash

So the idea here is to step through the string by 1 for strings sizes up to N - 1, and by greater than 1 for strings of size >= N, limiting the maximum number of characters sampled to between N // 2 and N - 1.  Another idea is to implement the methods on String, so they are invoked directly.  Another idea is to discard the speciesHash and use a better value for the null string hash, a prime whose bitCount is about half its highBit (i.e. about half of its bits are set).

We can rewrite these more cleanly to use the hashMultiply primitive, so here are newHashP64 through newHashP2048:
String>>newHashP64
"#hash is implemented, because #= is implemented"
| size hash |
size := self size.
size = 0 ifTrue: [^214748357 "16rCCCCCC5"].
hash := size < 262144
ifTrue: [size * 2617 "16rA39"]
ifFalse: [size + (size >> 16)].
1 to: size by: (size // 32 max: 1) do: "At most 63 characters"
[:i| hash := (hash + (self basicAt: i)) hashMultiply].
^hash

String>>newHashP2048
"#hash is implemented, because #= is implemented"
| size hash |
size := self size.
size = 0 ifTrue: [^214748357 "16rCCCCCC5"].
hash := size < 262144
ifTrue: [size * 2617 "16rA39"]
ifFalse: [size + (size >> 16)].
1 to: size by: (size // 1024 max: 1) do: "At most 2047 characters"
[:i| hash := (hash + (self basicAt: i)) hashMultiply].
^hash

So e.g. newHash2048 and newHashP2048 sample at most 2047 and at least 1024 characters for strings whose size exceeds 1024 elements, and all of the elements for all strings with size <= 1024 elements.

Let's compare both the hash spread (the number of distinct hashes produced) and the time taken to evaluate the three variants of hash function.  We have the interpreter primitive (hash implemented in terms of stringHash:initialHash:), newHash64 through newHash1024 (inlined hashMultiply in pure Smalltalk written to avoid overflow in to LargeInteger arithmetic) and newHashP64 through newHashP2048, written in pure Smalltalk but using the hashMultiply primitive (that avoids the need to decompose the multiplication to avoid overflow).

Here's the test harness.  A few things; it computes the blocks used rather than inlining them in the method to eliminate the cost of block dispatch form the measurements.  The block dispatch isn't complex but introduces a little noise.  Second, garbageCollectMost is used to run the scavenger before each measurement so that GC is in the same initial state; again this reduces noise.

| strs "strings" ns "number of strings" nus "number of unique strings" ass "average string size" blocks "the blocks that invoke each hash" |
Smalltalk garbageCollect.
strs := ByteString allSubInstances select: [:s| s size <= 32].
ns := strs size. nus := strs asSet size.
ass := ((strs inject: 0 into: [:sum :s| sum + s size]) / strs size) rounded.
blocks := #('hash' 'newHash64' 'newHash128' 'newHash256' 'newHash512' 'newHash1024' 'newHash2048' 'newHashP64' 'newHashP128' 'newHashP256' 'newHashP512' 'newHashP1024' 'newHashP2048' ) collect: [:f| Compiler evaluate: '[:ea| ea ', f, ']'].
blocks do: [:ea| ea value: ''];  do: [:ea| ea value: ''].
blocks collect:
[:hashBlock| | nh |
Smalltalk garbageCollectMost.
{ ns. nus. nh := (strs collect: hashBlock) asSet size.  nus - nh. 1.0 - (nh asFloat / nus asFloat). ass. [1 to: 100 do: [:i| strs do: hashBlock]] timeToRun - [1 to: 100 do: [:i| strs do: [:ea| ea class]]] timeToRun. (hashBlock sourceString allButFirst: 10) allButLast}]

N Strings N Unique N Hashes N Collisions fraction of collisions Avg String Size Time (ms) hash function
#(121162 54439 54435 4 7.347e-5 11 1926 'hash')

#(121162 54439 54435 3 5.510e-5 11 8913 'newHash64')
#(121162 54439 54435 3 5.510e-5 11 8879 'newHash128')
#(121162 54439 54435 3 5.510e-5 11 8870 'newHash256')
#(121162 54439 54435 3 5.510e-5 11 8835 'newHash512')
#(121162 54439 54435 3 5.510e-5 11 8879 'newHash1024')
#(121162 54439 54435 3 5.510e-5 11 8876 newHash2048')

#(121162 54439 54435 3 5.510e-5 11 5658 'newHashP64')
#(121162 54439 54435 3 5.510e-5 11 5506 'newHashP128')
#(121162 54439 54435 3 5.510e-5 11 5677 'newHashP256')
#(121162 54439 54435 3 5.510e-5 11 5595 'newHashP512')
#(121162 54439 54435 3 5.510e-5 11 5645 'newHashP1024')
#(121162 54439 54435 3 5.510e-5 11 5571 'newHashP2048'))

So for small strings the interpreter primitive wins on speed, considerably, but has one more collision (I suspect because the seed, ByteString identityHash, is poor).

Now for byte strings with sizes in the range 33 to 1024; I'll dispense with the newHash forms; they're essentially half the speed of the newHashP forms but otherwise identical.

N Strings N Unique N Hashes N Collisions fraction of collisions Avg String Size Time (ms) hash function
#(34044 25853 25852 1 3.8680e-5 148 1045 'hash')
#(34044 25853 25790 63 0.00243 148 2918 'newHashP64')
#(34044 25853 25847 6 0.00023 148 4929 'newHashP128')
#(34044 25853 25852 1 3.8680e-5 148 6757 'newHashP256')
#(34044 25853 25851 2 7.7360e-5 148 8959 'newHashP512')
#(34044 25853 25852 1 3.8680e-5 148 10055 'newHashP1024')
#(34044 25853 25852 1 3.8680e-5 148 10382 'newHashP2048'))

So here, hashing between 256 and 511 characters gives as good a distribution of hashes as considering all of the string.  So I think this shows that the cut off for effectiveness of string hashing is around 256 characters.  At least on the strings in my image not much is to be gained by hashing more.

So let's look at strings > 1024 in size

N Strings N Unique N Hashes N Collisions fraction of collisions Avg String Size Time (ms) hash function
#(732 606 606 0 0.0 50741 5834 'hash')

#(732 606 605 1 0.001650 50741 60 'newHashP64')
#(732 606 605 1 0.001650 50741 106 'newHashP128')
#(732 606 605 1 0.001650 50741 199 'newHashP256')
#(732 606 605 1 0.001650 50741 416 'newHashP512')
#(732 606 606 0 0.0 50741 822 'newHashP1024')
#(732 606 606 0 0.0 50741 1875 'newHashP2048'))

By this time the cost of hashing all characters overwhelms the primitive implementation and the pure Smalltalk code becomes much faster.  And the hash spread, the number of distinct hashes, is as good.

So that's the data.  My conclusions are that

- the primitive is clearly still a win, especially for small strings.  It could be written as a primitive that is run on the Smalltalk stack, and that would boost performance for small strings considerably.  But the primitive still wins against Cog code up through at least 150 byte strings.  We could run a different doit to detect the cross over in string length, but not today :-).

- replacing the primitive with one that behaves like newHash1024 or newHash2048 seems the best to me.  Such a primitive would hash between N and 2*N-1 characters for strings of length > N, where N would likely be 512, 1024 or 2048.  The primitive should also be written to hash 16-bit, 32-bit and 64-bit non-pointer arrays.

_,,,^..^,,,_
best, Eliot