Byte & String collection hash performance; a modest proposal for change.

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Byte & String collection hash performance; a modest proposal for change.

Eliot Miranda-2
 
Hi All,

    the hash algorithm used for ByteString in Squeak and Pharo is good for "small" strings and overkill for large strings.  It is important in many applications to get well distributed string hashes, especially over the range of strings that constitute things like method names, URLs, etc.  Consequently, the current algorithm includes every character in a string.  This works very well for "small" strings and results in very slow hashes (and hence long latencies, because the hash is an uninterruptible primitive) for large strings, where large may be several megabytes.

Let's look at the basic hash algorithm.  The following method is translated my compiler machinery in VMMaker from Smalltalk to C.  It creates a primitive function called primitiveStringHash, and so when invoked in normal Smalltalk code the method below invokes its C translation; neat.

ByteArray>>hashBytes: aByteArray startingWith: speciesHash
"Answer the hash of a byte-indexed collection, using speciesHash as the initial value.
See SmallInteger>>hashMultiply.

The primitive should be renamed at a suitable point in the future"
<primitive: 'primitiveStringHash' module: 'MiscPrimitivePlugin'>
| byteArraySize hash |
<var: 'aByteArray' type: #'unsigned char *'>
<var: 'speciesHash' type: #int>

byteArraySize := aByteArray size.
hash := speciesHash bitAnd: 16rFFFFFFF.
1 to: byteArraySize do:
[:pos |
hash := hash + (aByteArray basicAt: pos).
"Inlined hashMultiply, written this way for translation to C."
hash := hash * 1664525 bitAnd: 16r0FFFFFFF].
^hash

This function is invokes by a rather convoluted chain:

String>>hash
"#hash is implemented, because #= is implemented"
"ar 4/10/2005: I had to change this to use ByteString hash as initial 
hash in order to avoid having to rehash everything and yet compute
the same hash for ByteString and WideString.
md 16/10/2006: use identityHash as initialHash, as behavior hash will 
use String hash (name) to have a better hash soon.
eem 4/17/2017 it's not possible to use String hash (name) for the
initial hash because that would be recursive."
^self class stringHash: self initialHash: ByteString identityHash

ByteString class>>stringHash: aString initialHash: speciesHash
"Answer the hash of a byte-indexed string, using speciesHash as the initial value.
See SmallInteger>>hashMultiply."
<primitive: 'primitiveStringHash' module: 'MiscPrimitivePlugin'>
| hash |
hash := speciesHash bitAnd: 16rFFFFFFF.
1 to: aString size do:
[:pos |
hash := (hash + (aString basicAt: pos)) hashMultiply].
^hash

and the generic string implementation is

String class>>stringHash: aString initialHash: speciesHash
"Answer the hash of a byte-indexed string, using speciesHash as the initial value.
 See SmallInteger>>hashMultiply."
| hash |
hash := speciesHash bitAnd: 16rFFFFFFF.
1 to: aString size do:
[:pos |
hash := (hash + (aString basicAt: pos)) hashMultiply].
^hash

(it simply omits the primitive declaration).


As of yesterday the inner loop was written differently by Andres Valoud to avoid overflow:

hash := hash + (aByteArray basicAt: pos).
"Begin hashMultiply"
low := hash bitAnd: 16383.
hash := (16r260D * low + ((16r260D * (hash bitShift: -14) + (16r0065 * low) bitAnd: 16383) * 16384)) bitAnd: 16r0FFFFFFF.

The problem here is that the Smalltalk-to-C translation machinery is naive and entirely incapable of transforming 

low := hash bitAnd: 16383.
hash := (16r260D * low + ((16r260D * (hash bitShift: -14) + (16r0065 * low) bitAnd: 16383) * 16384)) bitAnd: 16r0FFFFFFF.

into
hash := hash * 1664525 bitAnd: 16r0FFFFFFF

The reformulation makes the primitive a little quicker, gaining for larger strings, but still suffers the high invocation overhead as described in the Cog Primitive Performance thread.

In looking at this I've added a primitive for hashMultiply; primitive #159 implements precisely self * 1664525 bitAnd: 16r0FFFFFFF for SmallInteger and LargePositiveInteger receivers, as fast as possible in the Cog JIT.  With this machinery in place it's instructive to compare the cost of the primitive against the non-primitive Smalltalk code.

First let me introduce a set of replacement hash functions, newHashN.  These hash all characters in strings up to a certain size, and then no more than that number for larger strings.  Here are newHash64 and newHash2048, which use pure Smalltalk, including an inlined hashMultiply written to avoid SmallInteger overflow.  Also measured are the obvious variants newHash128, newHash256, newHash512 & mewHash1024.

String>>newHash64
"#hash is implemented, because #= is implemented"
"choice of primes:
(HashedCollection goodPrimes select: [:n| n bitCount = (n highBit // 2) and: [n <= 16rFFFFFFF]]) collect: [:ea| {ea. ea hex}]"
| size hash |
size := self size.
size = 0 ifTrue: [^214748357 "16rCCCCCC5"].
hash := size < 262144
ifTrue: [size * 2617 "16rA39"]
ifFalse: [size + (size >> 16)].
1 to: size by: (size // 32 max: 1) do: "At most 63 characters"
[:i| | low |
hash := hash + (self basicAt: i).
"hash multiply"
low := hash bitAnd: 16383.
hash := (16r260D * low + ((16r260D * (hash bitShift: -14) + (16r0065 * low) bitAnd: 16383) * 16384)) bitAnd: 16r0FFFFFFF].
^hash

String>>newHash2048
"#hash is implemented, because #= is implemented"
"choice of primes:
(HashedCollection goodPrimes select: [:n| n bitCount = (n highBit // 2) and: [n <= 16rFFFFFFF]]) collect: [:ea| {ea. ea hex}]"
| size hash |
size := self size.
size = 0 ifTrue: [^214748357 "16rCCCCCC5"].
hash := size < 262144
ifTrue: [size * 2617 "16rA39"]
ifFalse: [size + (size >> 16)].
1 to: size by: (size // 1024 max: 1) do: "At most 2047 characters"
[:i| | low |
hash := hash + (self basicAt: i).
"hash multiply"
low := hash bitAnd: 16383.
hash := (16r260D * low + ((16r260D * (hash bitShift: -14) + (16r0065 * low) bitAnd: 16383) * 16384)) bitAnd: 16r0FFFFFFF].
^hash

So the idea here is to step through the string by 1 for strings sizes up to N - 1, and by greater than 1 for strings of size >= N, limiting the maximum number of characters sampled to between N // 2 and N - 1.  Another idea is to implement the methods on String, so they are invoked directly.  Another idea is to discard the speciesHash and use a better value for the null string hash, a prime whose bitCount is about half its highBit (i.e. about half of its bits are set).

We can rewrite these more cleanly to use the hashMultiply primitive, so here are newHashP64 through newHashP2048:
String>>newHashP64
"#hash is implemented, because #= is implemented"
| size hash |
size := self size.
size = 0 ifTrue: [^214748357 "16rCCCCCC5"].
hash := size < 262144
ifTrue: [size * 2617 "16rA39"]
ifFalse: [size + (size >> 16)].
1 to: size by: (size // 32 max: 1) do: "At most 63 characters"
[:i| hash := (hash + (self basicAt: i)) hashMultiply].
^hash

String>>newHashP2048
"#hash is implemented, because #= is implemented"
| size hash |
size := self size.
size = 0 ifTrue: [^214748357 "16rCCCCCC5"].
hash := size < 262144
ifTrue: [size * 2617 "16rA39"]
ifFalse: [size + (size >> 16)].
1 to: size by: (size // 1024 max: 1) do: "At most 2047 characters"
[:i| hash := (hash + (self basicAt: i)) hashMultiply].
^hash

So e.g. newHash2048 and newHashP2048 sample at most 2047 and at least 1024 characters for strings whose size exceeds 1024 elements, and all of the elements for all strings with size <= 1024 elements.

Let's compare both the hash spread (the number of distinct hashes produced) and the time taken to evaluate the three variants of hash function.  We have the interpreter primitive (hash implemented in terms of stringHash:initialHash:), newHash64 through newHash1024 (inlined hashMultiply in pure Smalltalk written to avoid overflow in to LargeInteger arithmetic) and newHashP64 through newHashP2048, written in pure Smalltalk but using the hashMultiply primitive (that avoids the need to decompose the multiplication to avoid overflow).

Here's the test harness.  A few things; it computes the blocks used rather than inlining them in the method to eliminate the cost of block dispatch form the measurements.  The block dispatch isn't complex but introduces a little noise.  Second, garbageCollectMost is used to run the scavenger before each measurement so that GC is in the same initial state; again this reduces noise.

| strs "strings" ns "number of strings" nus "number of unique strings" ass "average string size" blocks "the blocks that invoke each hash" |
Smalltalk garbageCollect.
strs := ByteString allSubInstances select: [:s| s size <= 32].
ns := strs size. nus := strs asSet size.
ass := ((strs inject: 0 into: [:sum :s| sum + s size]) / strs size) rounded.
blocks := #('hash' 'newHash64' 'newHash128' 'newHash256' 'newHash512' 'newHash1024' 'newHash2048' 'newHashP64' 'newHashP128' 'newHashP256' 'newHashP512' 'newHashP1024' 'newHashP2048' ) collect: [:f| Compiler evaluate: '[:ea| ea ', f, ']'].
blocks do: [:ea| ea value: ''];  do: [:ea| ea value: ''].
blocks collect:
[:hashBlock| | nh |
Smalltalk garbageCollectMost.
{ ns. nus. nh := (strs collect: hashBlock) asSet size.  nus - nh. 1.0 - (nh asFloat / nus asFloat). ass. [1 to: 100 do: [:i| strs do: hashBlock]] timeToRun - [1 to: 100 do: [:i| strs do: [:ea| ea class]]] timeToRun. (hashBlock sourceString allButFirst: 10) allButLast}]

N Strings N Unique N Hashes N Collisions fraction of collisions Avg String Size Time (ms) hash function
#(121162 54439 54435 4 7.347e-5 11 1926 'hash')

#(121162 54439 54435 3 5.510e-5 11 8913 'newHash64')
#(121162 54439 54435 3 5.510e-5 11 8879 'newHash128')
#(121162 54439 54435 3 5.510e-5 11 8870 'newHash256')
#(121162 54439 54435 3 5.510e-5 11 8835 'newHash512')
#(121162 54439 54435 3 5.510e-5 11 8879 'newHash1024')
#(121162 54439 54435 3 5.510e-5 11 8876 newHash2048')

#(121162 54439 54435 3 5.510e-5 11 5658 'newHashP64')
#(121162 54439 54435 3 5.510e-5 11 5506 'newHashP128')
#(121162 54439 54435 3 5.510e-5 11 5677 'newHashP256')
#(121162 54439 54435 3 5.510e-5 11 5595 'newHashP512')
#(121162 54439 54435 3 5.510e-5 11 5645 'newHashP1024')
#(121162 54439 54435 3 5.510e-5 11 5571 'newHashP2048'))

So for small strings the interpreter primitive wins on speed, considerably, but has one more collision (I suspect because the seed, ByteString identityHash, is poor).

Now for byte strings with sizes in the range 33 to 1024; I'll dispense with the newHash forms; they're essentially half the speed of the newHashP forms but otherwise identical.

N Strings N Unique N Hashes N Collisions fraction of collisions Avg String Size Time (ms) hash function
#(34044 25853 25852 1 3.8680e-5 148 1045 'hash')
#(34044 25853 25790 63 0.00243 148 2918 'newHashP64')
#(34044 25853 25847 6 0.00023 148 4929 'newHashP128')
#(34044 25853 25852 1 3.8680e-5 148 6757 'newHashP256')
#(34044 25853 25851 2 7.7360e-5 148 8959 'newHashP512')
#(34044 25853 25852 1 3.8680e-5 148 10055 'newHashP1024')
#(34044 25853 25852 1 3.8680e-5 148 10382 'newHashP2048'))

So here, hashing between 256 and 511 characters gives as good a distribution of hashes as considering all of the string.  So I think this shows that the cut off for effectiveness of string hashing is around 256 characters.  At least on the strings in my image not much is to be gained by hashing more.

So let's look at strings > 1024 in size

N Strings N Unique N Hashes N Collisions fraction of collisions Avg String Size Time (ms) hash function
#(732 606 606 0 0.0 50741 5834 'hash')

#(732 606 605 1 0.001650 50741 60 'newHashP64')
#(732 606 605 1 0.001650 50741 106 'newHashP128')
#(732 606 605 1 0.001650 50741 199 'newHashP256')
#(732 606 605 1 0.001650 50741 416 'newHashP512')
#(732 606 606 0 0.0 50741 822 'newHashP1024')
#(732 606 606 0 0.0 50741 1875 'newHashP2048'))

By this time the cost of hashing all characters overwhelms the primitive implementation and the pure Smalltalk code becomes much faster.  And the hash spread, the number of distinct hashes, is as good.

So that's the data.  My conclusions are that

- the primitive is clearly still a win, especially for small strings.  It could be written as a primitive that is run on the Smalltalk stack, and that would boost performance for small strings considerably.  But the primitive still wins against Cog code up through at least 150 byte strings.  We could run a different doit to detect the cross over in string length, but not today :-).

- replacing the primitive with one that behaves like newHash1024 or newHash2048 seems the best to me.  Such a primitive would hash between N and 2*N-1 characters for strings of length > N, where N would likely be 512, 1024 or 2048.  The primitive should also be written to hash 16-bit, 32-bit and 64-bit non-pointer arrays.

_,,,^..^,,,_
best, Eliot
Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-dev] Byte & String collection hash performance; a modest proposal for change.

Andrei Chis
 


On Mon, May 1, 2017 at 9:09 AM, Martin McClure <[hidden email]> wrote:

I see no replies to this on any of the three lists it was sent to, so I
guess I'll chime in.

tl;dr:
Making a primitive for #hashMultiply, probably a good idea, in some
form, since doing it in Smalltalk is awkward.
Only hashing every N-th character of large strings, probably a very bad
idea. Performance might well get worse, the complexity does not seem
justified, and it would open a sizeable security hole.

More verbiage below for those interested.

Regards,
-Martin

On 04/18/2017 07:09 PM, Eliot Miranda wrote:
> Hi All,
>
>     the hash algorithm used for ByteString in Squeak and Pharo is good
> for "small" strings and overkill for large strings.

Why do you say it's overkill for large strings? Are there applications
with large strings that are being negatively impacted by the current
algorithm? Which ones, and impacted how?

> It is important in many applications to get well distributed string
> hashes, especially over the range of strings that constitute things
> like method names, URLs, etc.  Consequently, the current algorithm
> includes every character in a string.  This works very well for
> "small" strings and results in very slow hashes (and hence long
> latencies, because the hash is an uninterruptible primitive) for large
> strings, where large may be several megabytes.

A simple solution for the uninterruptable primitive is to not make it a
primitive. Make #hashMultiply a primitive (since this particular kind of
numeric modulo computation is really painful in Smalltalk), and do the
rest in a loop in Smalltalk. It sounds like you've done the
#hashMultiply primitive already.

If the overhead of calling a primitive for each character proves to be
too much, even with the faster primitive calling methodologies you
talked about in the "Cog Primitive Performance" thread on the Vm-dev
list, a more complex primitive could take a range of bytes, so large
strings would be done in batches, solving the latency problem.

>
> Let's look at the basic hash algorithm.
[...]
>
> In looking at this I've added a primitive for hashMultiply; primitive
> #159 implements precisely self * 1664525 bitAnd: 16r0FFFFFFF for
> SmallInteger and LargePositiveInteger receivers, as fast as possible
> in the Cog JIT.  With this machinery in place it's instructive to
> compare the cost of the primitive against the non-primitive Smalltalk
> code.
>
> First let me introduce a set of replacement hash functions, newHashN.
> These hash all characters in strings up to a certain size, and then no
> more than that number for larger strings.  Here are newHash64 and
> newHash2048, which use pure Smalltalk, including an inlined
> hashMultiply written to avoid SmallInteger overflow.  Also measured
> are the obvious variants newHash128, newHash256, newHash512 & mewHash1024.
>
>
[...]
> So the idea here is to step through the string by 1 for strings sizes
> up to N - 1, and by greater than 1 for strings of size >= N, limiting
> the maximum number of characters sampled to between N // 2 and N - 1.

The history of computing is littered with the bones of those who have
tried this kind of thing. It doesn't end well. Yes, you get a faster
hash function. And then you find, for sets of data that you or your
users actually want to use, that you get collisions like crazy, and much
worse overall performance than you started with.

Sure, it works OK for the sets of data that the designer *tested*, and
probably for the sets of data that they *anticipated*. But real-world
data is tricky. It includes data sets where the characters that differ
are the ones that the hash thinks are unimportant, and there goes your
performance, by orders of magnitude. For instance, early versions of
Java used a limited number of characters to hash strings. One of the
biggest compatibility-breaking changes they were forced to make in later
Java versions was to consider *all* characters in hashing. It turned out
that it was very common to hash URLs, and many distinct URLs had most of
their characters in common.

Was curious about this and found the actual issues [1] if anybody else is interested.


Cheers,
Andrei

 

And you don't always get to choose your data -- sometimes you have an
opponent who is actively looking to create collisions as a
denial-of-service attack. There was a fair-sized kerfluffle about this a
few years ago -- most web scripting languages made it too easy to mount
this kind of attack.
https://arstechnica.com/business/2011/12/huge-portions-of-web-vulnerable-to-hashing-denial-of-service-attack/
"...an attacker can degenerate the hash table by sending lots of
colliding keys. ...making it possible to exhaust hours of CPU time using
a single HTTP request."
To guard against this kind of attack you need a randomized element in
your hash (not a bad idea for Smalltalk, actually, and pretty easy --
mixing in the identity hash of the collection might be sufficient) or a
cryptographic hash (not worth the computational expense for most
purposes). However, even adding a randomized element would not prevent
this kind of attack if you predictably completely ignore some characters
of the input string. That just makes it *so* easy to generate data that
validates, and is not equal, but causes collisions.

So really, for general-purpose use (i.e. what's built into the language)
hash *every* character of *all* strings. If someone finds that this is a
performance problem in a real-world situation, it can be addressed in an
application-specific way.


Reply | Threaded
Open this post in threaded view
|

Re: [squeak-dev] [Pharo-dev] Byte & String collection hash performance; a modest proposal for change.

David T. Lewis
In reply to this post by Eliot Miranda-2
 
Does it need to be done in the VM? Why not make a class LargeString with
instance variables aString and myCalculatedHashValueForTheString. That way
you can cache the hash value and calculate it any way you want.

I know vey little about hashing, just wondering if this kind of thing can
be handled more easily in the image.

Dave


>
> On Mon, 1 May 2017, Levente Uzonyi wrote:
>
>> Well, I had started to write a reply, but I had to postpone it.
>> I mostly agree with your suggestions.
>>
>> One thing that can be done about large strings is to cache the
>> calculated
>> hash value in the larger strings. Currently the object representation
>> changes when the string contains 255 or more characters. In that case an
>> additional 64 bit field is added to the object header to store its
>> length.
>> If we were to use the upper 28+1 bits of that field to cache the hash,
>> there would still be 35-bits to encode length, which would be enough to
>> represent strings up to 8 GiB.
>
> Well, we can keep the whole range minus one bit, but then we can't store
> the hash for strings larger than 8 GiB.
>
> Levente
>
>> But this would require further VM changes (e.g. at:put: would have to
>> flush the cache).
>>
>> Levente
>>
>> On Mon, 1 May 2017, Martin McClure wrote:
>>
>>> I see no replies to this on any of the three lists it was sent to, so I
>>> guess I'll chime in.
>>>
>>> tl;dr:
>>> Making a primitive for #hashMultiply, probably a good idea, in some
>>> form, since doing it in Smalltalk is awkward.
>>> Only hashing every N-th character of large strings, probably a very bad
>>> idea. Performance might well get worse, the complexity does not seem
>>> justified, and it would open a sizeable security hole.
>>>
>>> More verbiage below for those interested.
>>>
>>> Regards,
>>> -Martin
>>>
>>> On 04/18/2017 07:09 PM, Eliot Miranda wrote:
>>>> Hi All,
>>>>
>>>>     the hash algorithm used for ByteString in Squeak and Pharo is good
>>>> for "small" strings and overkill for large strings.
>>>
>>> Why do you say it's overkill for large strings? Are there applications
>>> with large strings that are being negatively impacted by the current
>>> algorithm? Which ones, and impacted how?
>>>
>>>> It is important in many applications to get well distributed string
>>>> hashes, especially over the range of strings that constitute things
>>>> like method names, URLs, etc.  Consequently, the current algorithm
>>>> includes every character in a string.  This works very well for
>>>> "small" strings and results in very slow hashes (and hence long
>>>> latencies, because the hash is an uninterruptible primitive) for large
>>>> strings, where large may be several megabytes.
>>>
>>> A simple solution for the uninterruptable primitive is to not make it a
>>> primitive. Make #hashMultiply a primitive (since this particular kind
>>> of
>>> numeric modulo computation is really painful in Smalltalk), and do the
>>> rest in a loop in Smalltalk. It sounds like you've done the
>>> #hashMultiply primitive already.
>>>
>>> If the overhead of calling a primitive for each character proves to be
>>> too much, even with the faster primitive calling methodologies you
>>> talked about in the "Cog Primitive Performance" thread on the Vm-dev
>>> list, a more complex primitive could take a range of bytes, so large
>>> strings would be done in batches, solving the latency problem.
>>>
>>>>
>>>> Let's look at the basic hash algorithm.
>>> [...]
>>>>
>>>> In looking at this I've added a primitive for hashMultiply; primitive
>>>> #159 implements precisely self * 1664525 bitAnd: 16r0FFFFFFF for
>>>> SmallInteger and LargePositiveInteger receivers, as fast as possible
>>>> in the Cog JIT.  With this machinery in place it's instructive to
>>>> compare the cost of the primitive against the non-primitive Smalltalk
>>>> code.
>>>>
>>>> First let me introduce a set of replacement hash functions, newHashN.
>>>> These hash all characters in strings up to a certain size, and then no
>>>> more than that number for larger strings.  Here are newHash64 and
>>>> newHash2048, which use pure Smalltalk, including an inlined
>>>> hashMultiply written to avoid SmallInteger overflow.  Also measured
>>>> are the obvious variants newHash128, newHash256, newHash512 &
>>>> mewHash1024.
>>>>
>>>>
>>> [...]
>>>> So the idea here is to step through the string by 1 for strings sizes
>>>> up to N - 1, and by greater than 1 for strings of size >= N, limiting
>>>> the maximum number of characters sampled to between N // 2 and N - 1.
>>>
>>> The history of computing is littered with the bones of those who have
>>> tried this kind of thing. It doesn't end well. Yes, you get a faster
>>> hash function. And then you find, for sets of data that you or your
>>> users actually want to use, that you get collisions like crazy, and
>>> much
>>> worse overall performance than you started with.
>>>
>>> Sure, it works OK for the sets of data that the designer *tested*, and
>>> probably for the sets of data that they *anticipated*. But real-world
>>> data is tricky. It includes data sets where the characters that differ
>>> are the ones that the hash thinks are unimportant, and there goes your
>>> performance, by orders of magnitude. For instance, early versions of
>>> Java used a limited number of characters to hash strings. One of the
>>> biggest compatibility-breaking changes they were forced to make in
>>> later
>>> Java versions was to consider *all* characters in hashing. It turned
>>> out
>>> that it was very common to hash URLs, and many distinct URLs had most
>>> of
>>> their characters in common.
>>>
>>> And you don't always get to choose your data -- sometimes you have an
>>> opponent who is actively looking to create collisions as a
>>> denial-of-service attack. There was a fair-sized kerfluffle about this
>>> a
>>> few years ago -- most web scripting languages made it too easy to mount
>>> this kind of attack.
>>>
>> https://arstechnica.com/business/2011/12/huge-portions-of-web-vulnerable-to-hashing-denial-of-service-attack/
>>> "...an attacker can degenerate the hash table by sending lots of
>>> colliding keys. ...making it possible to exhaust hours of CPU time
>>> using
>>> a single HTTP request."
>>> To guard against this kind of attack you need a randomized element in
>>> your hash (not a bad idea for Smalltalk, actually, and pretty easy --
>>> mixing in the identity hash of the collection might be sufficient) or a
>>> cryptographic hash (not worth the computational expense for most
>>> purposes). However, even adding a randomized element would not prevent
>>> this kind of attack if you predictably completely ignore some
>>> characters
>>> of the input string. That just makes it *so* easy to generate data that
>>> validates, and is not equal, but causes collisions.
>>>
>>> So really, for general-purpose use (i.e. what's built into the
>>> language)
>>> hash *every* character of *all* strings. If someone finds that this is
>>> a
>>> performance problem in a real-world situation, it can be addressed in
>>> an
>>> application-specific way.
>>
>>
>