Byte & String collection hash performance; a modest proposal for change.

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Byte & String collection hash performance; a modest proposal for change.

Eliot Miranda-2
Hi All,

    the hash algorithm used for ByteString in Squeak and Pharo is good for "small" strings and overkill for large strings.  It is important in many applications to get well distributed string hashes, especially over the range of strings that constitute things like method names, URLs, etc.  Consequently, the current algorithm includes every character in a string.  This works very well for "small" strings and results in very slow hashes (and hence long latencies, because the hash is an uninterruptible primitive) for large strings, where large may be several megabytes.

Let's look at the basic hash algorithm.  The following method is translated my compiler machinery in VMMaker from Smalltalk to C.  It creates a primitive function called primitiveStringHash, and so when invoked in normal Smalltalk code the method below invokes its C translation; neat.

ByteArray>>hashBytes: aByteArray startingWith: speciesHash
"Answer the hash of a byte-indexed collection, using speciesHash as the initial value.
See SmallInteger>>hashMultiply.

The primitive should be renamed at a suitable point in the future"
<primitive: 'primitiveStringHash' module: 'MiscPrimitivePlugin'>
| byteArraySize hash |
<var: 'aByteArray' type: #'unsigned char *'>
<var: 'speciesHash' type: #int>

byteArraySize := aByteArray size.
hash := speciesHash bitAnd: 16rFFFFFFF.
1 to: byteArraySize do:
[:pos |
hash := hash + (aByteArray basicAt: pos).
"Inlined hashMultiply, written this way for translation to C."
hash := hash * 1664525 bitAnd: 16r0FFFFFFF].
^hash

This function is invokes by a rather convoluted chain:

String>>hash
"#hash is implemented, because #= is implemented"
"ar 4/10/2005: I had to change this to use ByteString hash as initial 
hash in order to avoid having to rehash everything and yet compute
the same hash for ByteString and WideString.
md 16/10/2006: use identityHash as initialHash, as behavior hash will 
use String hash (name) to have a better hash soon.
eem 4/17/2017 it's not possible to use String hash (name) for the
initial hash because that would be recursive."
^self class stringHash: self initialHash: ByteString identityHash

ByteString class>>stringHash: aString initialHash: speciesHash
"Answer the hash of a byte-indexed string, using speciesHash as the initial value.
See SmallInteger>>hashMultiply."
<primitive: 'primitiveStringHash' module: 'MiscPrimitivePlugin'>
| hash |
hash := speciesHash bitAnd: 16rFFFFFFF.
1 to: aString size do:
[:pos |
hash := (hash + (aString basicAt: pos)) hashMultiply].
^hash

and the generic string implementation is

String class>>stringHash: aString initialHash: speciesHash
"Answer the hash of a byte-indexed string, using speciesHash as the initial value.
 See SmallInteger>>hashMultiply."
| hash |
hash := speciesHash bitAnd: 16rFFFFFFF.
1 to: aString size do:
[:pos |
hash := (hash + (aString basicAt: pos)) hashMultiply].
^hash

(it simply omits the primitive declaration).


As of yesterday the inner loop was written differently by Andres Valoud to avoid overflow:

hash := hash + (aByteArray basicAt: pos).
"Begin hashMultiply"
low := hash bitAnd: 16383.
hash := (16r260D * low + ((16r260D * (hash bitShift: -14) + (16r0065 * low) bitAnd: 16383) * 16384)) bitAnd: 16r0FFFFFFF.

The problem here is that the Smalltalk-to-C translation machinery is naive and entirely incapable of transforming 

low := hash bitAnd: 16383.
hash := (16r260D * low + ((16r260D * (hash bitShift: -14) + (16r0065 * low) bitAnd: 16383) * 16384)) bitAnd: 16r0FFFFFFF.

into
hash := hash * 1664525 bitAnd: 16r0FFFFFFF

The reformulation makes the primitive a little quicker, gaining for larger strings, but still suffers the high invocation overhead as described in the Cog Primitive Performance thread.

In looking at this I've added a primitive for hashMultiply; primitive #159 implements precisely self * 1664525 bitAnd: 16r0FFFFFFF for SmallInteger and LargePositiveInteger receivers, as fast as possible in the Cog JIT.  With this machinery in place it's instructive to compare the cost of the primitive against the non-primitive Smalltalk code.

First let me introduce a set of replacement hash functions, newHashN.  These hash all characters in strings up to a certain size, and then no more than that number for larger strings.  Here are newHash64 and newHash2048, which use pure Smalltalk, including an inlined hashMultiply written to avoid SmallInteger overflow.  Also measured are the obvious variants newHash128, newHash256, newHash512 & mewHash1024.

String>>newHash64
"#hash is implemented, because #= is implemented"
"choice of primes:
(HashedCollection goodPrimes select: [:n| n bitCount = (n highBit // 2) and: [n <= 16rFFFFFFF]]) collect: [:ea| {ea. ea hex}]"
| size hash |
size := self size.
size = 0 ifTrue: [^214748357 "16rCCCCCC5"].
hash := size < 262144
ifTrue: [size * 2617 "16rA39"]
ifFalse: [size + (size >> 16)].
1 to: size by: (size // 32 max: 1) do: "At most 63 characters"
[:i| | low |
hash := hash + (self basicAt: i).
"hash multiply"
low := hash bitAnd: 16383.
hash := (16r260D * low + ((16r260D * (hash bitShift: -14) + (16r0065 * low) bitAnd: 16383) * 16384)) bitAnd: 16r0FFFFFFF].
^hash

String>>newHash2048
"#hash is implemented, because #= is implemented"
"choice of primes:
(HashedCollection goodPrimes select: [:n| n bitCount = (n highBit // 2) and: [n <= 16rFFFFFFF]]) collect: [:ea| {ea. ea hex}]"
| size hash |
size := self size.
size = 0 ifTrue: [^214748357 "16rCCCCCC5"].
hash := size < 262144
ifTrue: [size * 2617 "16rA39"]
ifFalse: [size + (size >> 16)].
1 to: size by: (size // 1024 max: 1) do: "At most 2047 characters"
[:i| | low |
hash := hash + (self basicAt: i).
"hash multiply"
low := hash bitAnd: 16383.
hash := (16r260D * low + ((16r260D * (hash bitShift: -14) + (16r0065 * low) bitAnd: 16383) * 16384)) bitAnd: 16r0FFFFFFF].
^hash

So the idea here is to step through the string by 1 for strings sizes up to N - 1, and by greater than 1 for strings of size >= N, limiting the maximum number of characters sampled to between N // 2 and N - 1.  Another idea is to implement the methods on String, so they are invoked directly.  Another idea is to discard the speciesHash and use a better value for the null string hash, a prime whose bitCount is about half its highBit (i.e. about half of its bits are set).

We can rewrite these more cleanly to use the hashMultiply primitive, so here are newHashP64 through newHashP2048:
String>>newHashP64
"#hash is implemented, because #= is implemented"
| size hash |
size := self size.
size = 0 ifTrue: [^214748357 "16rCCCCCC5"].
hash := size < 262144
ifTrue: [size * 2617 "16rA39"]
ifFalse: [size + (size >> 16)].
1 to: size by: (size // 32 max: 1) do: "At most 63 characters"
[:i| hash := (hash + (self basicAt: i)) hashMultiply].
^hash

String>>newHashP2048
"#hash is implemented, because #= is implemented"
| size hash |
size := self size.
size = 0 ifTrue: [^214748357 "16rCCCCCC5"].
hash := size < 262144
ifTrue: [size * 2617 "16rA39"]
ifFalse: [size + (size >> 16)].
1 to: size by: (size // 1024 max: 1) do: "At most 2047 characters"
[:i| hash := (hash + (self basicAt: i)) hashMultiply].
^hash

So e.g. newHash2048 and newHashP2048 sample at most 2047 and at least 1024 characters for strings whose size exceeds 1024 elements, and all of the elements for all strings with size <= 1024 elements.

Let's compare both the hash spread (the number of distinct hashes produced) and the time taken to evaluate the three variants of hash function.  We have the interpreter primitive (hash implemented in terms of stringHash:initialHash:), newHash64 through newHash1024 (inlined hashMultiply in pure Smalltalk written to avoid overflow in to LargeInteger arithmetic) and newHashP64 through newHashP2048, written in pure Smalltalk but using the hashMultiply primitive (that avoids the need to decompose the multiplication to avoid overflow).

Here's the test harness.  A few things; it computes the blocks used rather than inlining them in the method to eliminate the cost of block dispatch form the measurements.  The block dispatch isn't complex but introduces a little noise.  Second, garbageCollectMost is used to run the scavenger before each measurement so that GC is in the same initial state; again this reduces noise.

| strs "strings" ns "number of strings" nus "number of unique strings" ass "average string size" blocks "the blocks that invoke each hash" |
Smalltalk garbageCollect.
strs := ByteString allSubInstances select: [:s| s size <= 32].
ns := strs size. nus := strs asSet size.
ass := ((strs inject: 0 into: [:sum :s| sum + s size]) / strs size) rounded.
blocks := #('hash' 'newHash64' 'newHash128' 'newHash256' 'newHash512' 'newHash1024' 'newHash2048' 'newHashP64' 'newHashP128' 'newHashP256' 'newHashP512' 'newHashP1024' 'newHashP2048' ) collect: [:f| Compiler evaluate: '[:ea| ea ', f, ']'].
blocks do: [:ea| ea value: ''];  do: [:ea| ea value: ''].
blocks collect:
[:hashBlock| | nh |
Smalltalk garbageCollectMost.
{ ns. nus. nh := (strs collect: hashBlock) asSet size.  nus - nh. 1.0 - (nh asFloat / nus asFloat). ass. [1 to: 100 do: [:i| strs do: hashBlock]] timeToRun - [1 to: 100 do: [:i| strs do: [:ea| ea class]]] timeToRun. (hashBlock sourceString allButFirst: 10) allButLast}]

N Strings N Unique N Hashes N Collisions fraction of collisions Avg String Size Time (ms) hash function
#(121162 54439 54435 4 7.347e-5 11 1926 'hash')

#(121162 54439 54435 3 5.510e-5 11 8913 'newHash64')
#(121162 54439 54435 3 5.510e-5 11 8879 'newHash128')
#(121162 54439 54435 3 5.510e-5 11 8870 'newHash256')
#(121162 54439 54435 3 5.510e-5 11 8835 'newHash512')
#(121162 54439 54435 3 5.510e-5 11 8879 'newHash1024')
#(121162 54439 54435 3 5.510e-5 11 8876 newHash2048')

#(121162 54439 54435 3 5.510e-5 11 5658 'newHashP64')
#(121162 54439 54435 3 5.510e-5 11 5506 'newHashP128')
#(121162 54439 54435 3 5.510e-5 11 5677 'newHashP256')
#(121162 54439 54435 3 5.510e-5 11 5595 'newHashP512')
#(121162 54439 54435 3 5.510e-5 11 5645 'newHashP1024')
#(121162 54439 54435 3 5.510e-5 11 5571 'newHashP2048'))

So for small strings the interpreter primitive wins on speed, considerably, but has one more collision (I suspect because the seed, ByteString identityHash, is poor).

Now for byte strings with sizes in the range 33 to 1024; I'll dispense with the newHash forms; they're essentially half the speed of the newHashP forms but otherwise identical.

N Strings N Unique N Hashes N Collisions fraction of collisions Avg String Size Time (ms) hash function
#(34044 25853 25852 1 3.8680e-5 148 1045 'hash')
#(34044 25853 25790 63 0.00243 148 2918 'newHashP64')
#(34044 25853 25847 6 0.00023 148 4929 'newHashP128')
#(34044 25853 25852 1 3.8680e-5 148 6757 'newHashP256')
#(34044 25853 25851 2 7.7360e-5 148 8959 'newHashP512')
#(34044 25853 25852 1 3.8680e-5 148 10055 'newHashP1024')
#(34044 25853 25852 1 3.8680e-5 148 10382 'newHashP2048'))

So here, hashing between 256 and 511 characters gives as good a distribution of hashes as considering all of the string.  So I think this shows that the cut off for effectiveness of string hashing is around 256 characters.  At least on the strings in my image not much is to be gained by hashing more.

So let's look at strings > 1024 in size

N Strings N Unique N Hashes N Collisions fraction of collisions Avg String Size Time (ms) hash function
#(732 606 606 0 0.0 50741 5834 'hash')

#(732 606 605 1 0.001650 50741 60 'newHashP64')
#(732 606 605 1 0.001650 50741 106 'newHashP128')
#(732 606 605 1 0.001650 50741 199 'newHashP256')
#(732 606 605 1 0.001650 50741 416 'newHashP512')
#(732 606 606 0 0.0 50741 822 'newHashP1024')
#(732 606 606 0 0.0 50741 1875 'newHashP2048'))

By this time the cost of hashing all characters overwhelms the primitive implementation and the pure Smalltalk code becomes much faster.  And the hash spread, the number of distinct hashes, is as good.

So that's the data.  My conclusions are that

- the primitive is clearly still a win, especially for small strings.  It could be written as a primitive that is run on the Smalltalk stack, and that would boost performance for small strings considerably.  But the primitive still wins against Cog code up through at least 150 byte strings.  We could run a different doit to detect the cross over in string length, but not today :-).

- replacing the primitive with one that behaves like newHash1024 or newHash2048 seems the best to me.  Such a primitive would hash between N and 2*N-1 characters for strings of length > N, where N would likely be 512, 1024 or 2048.  The primitive should also be written to hash 16-bit, 32-bit and 64-bit non-pointer arrays.

_,,,^..^,,,_
best, Eliot


Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-dev] Byte & String collection hash performance; a modest proposal for change.

Martin McClure-2
I see no replies to this on any of the three lists it was sent to, so I
guess I'll chime in.

tl;dr:
Making a primitive for #hashMultiply, probably a good idea, in some
form, since doing it in Smalltalk is awkward.
Only hashing every N-th character of large strings, probably a very bad
idea. Performance might well get worse, the complexity does not seem
justified, and it would open a sizeable security hole.

More verbiage below for those interested.

Regards,
-Martin

On 04/18/2017 07:09 PM, Eliot Miranda wrote:
> Hi All,
>
>     the hash algorithm used for ByteString in Squeak and Pharo is good
> for "small" strings and overkill for large strings.

Why do you say it's overkill for large strings? Are there applications
with large strings that are being negatively impacted by the current
algorithm? Which ones, and impacted how?

> It is important in many applications to get well distributed string
> hashes, especially over the range of strings that constitute things
> like method names, URLs, etc.  Consequently, the current algorithm
> includes every character in a string.  This works very well for
> "small" strings and results in very slow hashes (and hence long
> latencies, because the hash is an uninterruptible primitive) for large
> strings, where large may be several megabytes.

A simple solution for the uninterruptable primitive is to not make it a
primitive. Make #hashMultiply a primitive (since this particular kind of
numeric modulo computation is really painful in Smalltalk), and do the
rest in a loop in Smalltalk. It sounds like you've done the
#hashMultiply primitive already.

If the overhead of calling a primitive for each character proves to be
too much, even with the faster primitive calling methodologies you
talked about in the "Cog Primitive Performance" thread on the Vm-dev
list, a more complex primitive could take a range of bytes, so large
strings would be done in batches, solving the latency problem.

>
> Let's look at the basic hash algorithm.
[...]

>
> In looking at this I've added a primitive for hashMultiply; primitive
> #159 implements precisely self * 1664525 bitAnd: 16r0FFFFFFF for
> SmallInteger and LargePositiveInteger receivers, as fast as possible
> in the Cog JIT.  With this machinery in place it's instructive to
> compare the cost of the primitive against the non-primitive Smalltalk
> code.
>
> First let me introduce a set of replacement hash functions, newHashN.
> These hash all characters in strings up to a certain size, and then no
> more than that number for larger strings.  Here are newHash64 and
> newHash2048, which use pure Smalltalk, including an inlined
> hashMultiply written to avoid SmallInteger overflow.  Also measured
> are the obvious variants newHash128, newHash256, newHash512 & mewHash1024.
>
>
[...]
> So the idea here is to step through the string by 1 for strings sizes
> up to N - 1, and by greater than 1 for strings of size >= N, limiting
> the maximum number of characters sampled to between N // 2 and N - 1.

The history of computing is littered with the bones of those who have
tried this kind of thing. It doesn't end well. Yes, you get a faster
hash function. And then you find, for sets of data that you or your
users actually want to use, that you get collisions like crazy, and much
worse overall performance than you started with.

Sure, it works OK for the sets of data that the designer *tested*, and
probably for the sets of data that they *anticipated*. But real-world
data is tricky. It includes data sets where the characters that differ
are the ones that the hash thinks are unimportant, and there goes your
performance, by orders of magnitude. For instance, early versions of
Java used a limited number of characters to hash strings. One of the
biggest compatibility-breaking changes they were forced to make in later
Java versions was to consider *all* characters in hashing. It turned out
that it was very common to hash URLs, and many distinct URLs had most of
their characters in common.

And you don't always get to choose your data -- sometimes you have an
opponent who is actively looking to create collisions as a
denial-of-service attack. There was a fair-sized kerfluffle about this a
few years ago -- most web scripting languages made it too easy to mount
this kind of attack.
https://arstechnica.com/business/2011/12/huge-portions-of-web-vulnerable-to-hashing-denial-of-service-attack/
"...an attacker can degenerate the hash table by sending lots of
colliding keys. ...making it possible to exhaust hours of CPU time using
a single HTTP request."
To guard against this kind of attack you need a randomized element in
your hash (not a bad idea for Smalltalk, actually, and pretty easy --
mixing in the identity hash of the collection might be sufficient) or a
cryptographic hash (not worth the computational expense for most
purposes). However, even adding a randomized element would not prevent
this kind of attack if you predictably completely ignore some characters
of the input string. That just makes it *so* easy to generate data that
validates, and is not equal, but causes collisions.

So really, for general-purpose use (i.e. what's built into the language)
hash *every* character of *all* strings. If someone finds that this is a
performance problem in a real-world situation, it can be addressed in an
application-specific way.


Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-dev] [Vm-dev] Byte & String collection hash performance; a modest proposal for change.

Martin McClure-2
On 05/01/2017 02:24 AM, Andrei Chis wrote:
performance, by orders of magnitude. For instance, early versions of
Java used a limited number of characters to hash strings. One of the
biggest compatibility-breaking changes they were forced to make in later
Java versions was to consider *all* characters in hashing. It turned out
that it was very common to hash URLs, and many distinct URLs had most of
their characters in common.

Was curious about this and found the actual issues [1] if anybody else is interested.


Thanks for digging this up, Andrei! I was working from memory, and had forgotten some of the details. Yeah, they had to change the language spec, which was a Big Deal at the time. (and Josh Bloch and Guy Steele, no less!)

Regards,

-Martin



Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-dev] Byte & String collection hash performance; a modest proposal for change.

Levente Uzonyi
In reply to this post by Martin McClure-2
Well, I had started to write a reply, but I had to postpone it.
I mostly agree with your suggestions.

One thing that can be done about large strings is to cache the calculated
hash value in the larger strings. Currently the object representation
changes when the string contains 255 or more characters. In that case an
additional 64 bit field is added to the object header to store its length.
If we were to use the upper 28+1 bits of that field to cache the hash,
there would still be 35-bits to encode length, which would be enough to
represent strings up to 8 GiB.
But this would require further VM changes (e.g. at:put: would have to
flush the cache).

Levente

On Mon, 1 May 2017, Martin McClure wrote:

> I see no replies to this on any of the three lists it was sent to, so I
> guess I'll chime in.
>
> tl;dr:
> Making a primitive for #hashMultiply, probably a good idea, in some
> form, since doing it in Smalltalk is awkward.
> Only hashing every N-th character of large strings, probably a very bad
> idea. Performance might well get worse, the complexity does not seem
> justified, and it would open a sizeable security hole.
>
> More verbiage below for those interested.
>
> Regards,
> -Martin
>
> On 04/18/2017 07:09 PM, Eliot Miranda wrote:
>> Hi All,
>>
>>     the hash algorithm used for ByteString in Squeak and Pharo is good
>> for "small" strings and overkill for large strings.
>
> Why do you say it's overkill for large strings? Are there applications
> with large strings that are being negatively impacted by the current
> algorithm? Which ones, and impacted how?
>
>> It is important in many applications to get well distributed string
>> hashes, especially over the range of strings that constitute things
>> like method names, URLs, etc.  Consequently, the current algorithm
>> includes every character in a string.  This works very well for
>> "small" strings and results in very slow hashes (and hence long
>> latencies, because the hash is an uninterruptible primitive) for large
>> strings, where large may be several megabytes.
>
> A simple solution for the uninterruptable primitive is to not make it a
> primitive. Make #hashMultiply a primitive (since this particular kind of
> numeric modulo computation is really painful in Smalltalk), and do the
> rest in a loop in Smalltalk. It sounds like you've done the
> #hashMultiply primitive already.
>
> If the overhead of calling a primitive for each character proves to be
> too much, even with the faster primitive calling methodologies you
> talked about in the "Cog Primitive Performance" thread on the Vm-dev
> list, a more complex primitive could take a range of bytes, so large
> strings would be done in batches, solving the latency problem.
>
>>
>> Let's look at the basic hash algorithm.
> [...]
>>
>> In looking at this I've added a primitive for hashMultiply; primitive
>> #159 implements precisely self * 1664525 bitAnd: 16r0FFFFFFF for
>> SmallInteger and LargePositiveInteger receivers, as fast as possible
>> in the Cog JIT.  With this machinery in place it's instructive to
>> compare the cost of the primitive against the non-primitive Smalltalk
>> code.
>>
>> First let me introduce a set of replacement hash functions, newHashN.
>> These hash all characters in strings up to a certain size, and then no
>> more than that number for larger strings.  Here are newHash64 and
>> newHash2048, which use pure Smalltalk, including an inlined
>> hashMultiply written to avoid SmallInteger overflow.  Also measured
>> are the obvious variants newHash128, newHash256, newHash512 & mewHash1024.
>>
>>
> [...]
>> So the idea here is to step through the string by 1 for strings sizes
>> up to N - 1, and by greater than 1 for strings of size >= N, limiting
>> the maximum number of characters sampled to between N // 2 and N - 1.
>
> The history of computing is littered with the bones of those who have
> tried this kind of thing. It doesn't end well. Yes, you get a faster
> hash function. And then you find, for sets of data that you or your
> users actually want to use, that you get collisions like crazy, and much
> worse overall performance than you started with.
>
> Sure, it works OK for the sets of data that the designer *tested*, and
> probably for the sets of data that they *anticipated*. But real-world
> data is tricky. It includes data sets where the characters that differ
> are the ones that the hash thinks are unimportant, and there goes your
> performance, by orders of magnitude. For instance, early versions of
> Java used a limited number of characters to hash strings. One of the
> biggest compatibility-breaking changes they were forced to make in later
> Java versions was to consider *all* characters in hashing. It turned out
> that it was very common to hash URLs, and many distinct URLs had most of
> their characters in common.
>
> And you don't always get to choose your data -- sometimes you have an
> opponent who is actively looking to create collisions as a
> denial-of-service attack. There was a fair-sized kerfluffle about this a
> few years ago -- most web scripting languages made it too easy to mount
> this kind of attack.
> https://arstechnica.com/business/2011/12/huge-portions-of-web-vulnerable-to-hashing-denial-of-service-attack/
> "...an attacker can degenerate the hash table by sending lots of
> colliding keys. ...making it possible to exhaust hours of CPU time using
> a single HTTP request."
> To guard against this kind of attack you need a randomized element in
> your hash (not a bad idea for Smalltalk, actually, and pretty easy --
> mixing in the identity hash of the collection might be sufficient) or a
> cryptographic hash (not worth the computational expense for most
> purposes). However, even adding a randomized element would not prevent
> this kind of attack if you predictably completely ignore some characters
> of the input string. That just makes it *so* easy to generate data that
> validates, and is not equal, but causes collisions.
>
> So really, for general-purpose use (i.e. what's built into the language)
> hash *every* character of *all* strings. If someone finds that this is a
> performance problem in a real-world situation, it can be addressed in an
> application-specific way.

Reply | Threaded
Open this post in threaded view
|

Re: [Pharo-dev] Byte & String collection hash performance; a modest proposal for change.

Levente Uzonyi
On Mon, 1 May 2017, Levente Uzonyi wrote:

> Well, I had started to write a reply, but I had to postpone it.
> I mostly agree with your suggestions.
>
> One thing that can be done about large strings is to cache the calculated
> hash value in the larger strings. Currently the object representation
> changes when the string contains 255 or more characters. In that case an
> additional 64 bit field is added to the object header to store its length.
> If we were to use the upper 28+1 bits of that field to cache the hash,
> there would still be 35-bits to encode length, which would be enough to
> represent strings up to 8 GiB.

Well, we can keep the whole range minus one bit, but then we can't store
the hash for strings larger than 8 GiB.

Levente

> But this would require further VM changes (e.g. at:put: would have to
> flush the cache).
>
> Levente
>
> On Mon, 1 May 2017, Martin McClure wrote:
>
>> I see no replies to this on any of the three lists it was sent to, so I
>> guess I'll chime in.
>>
>> tl;dr:
>> Making a primitive for #hashMultiply, probably a good idea, in some
>> form, since doing it in Smalltalk is awkward.
>> Only hashing every N-th character of large strings, probably a very bad
>> idea. Performance might well get worse, the complexity does not seem
>> justified, and it would open a sizeable security hole.
>>
>> More verbiage below for those interested.
>>
>> Regards,
>> -Martin
>>
>> On 04/18/2017 07:09 PM, Eliot Miranda wrote:
>>> Hi All,
>>>
>>>     the hash algorithm used for ByteString in Squeak and Pharo is good
>>> for "small" strings and overkill for large strings.
>>
>> Why do you say it's overkill for large strings? Are there applications
>> with large strings that are being negatively impacted by the current
>> algorithm? Which ones, and impacted how?
>>
>>> It is important in many applications to get well distributed string
>>> hashes, especially over the range of strings that constitute things
>>> like method names, URLs, etc.  Consequently, the current algorithm
>>> includes every character in a string.  This works very well for
>>> "small" strings and results in very slow hashes (and hence long
>>> latencies, because the hash is an uninterruptible primitive) for large
>>> strings, where large may be several megabytes.
>>
>> A simple solution for the uninterruptable primitive is to not make it a
>> primitive. Make #hashMultiply a primitive (since this particular kind of
>> numeric modulo computation is really painful in Smalltalk), and do the
>> rest in a loop in Smalltalk. It sounds like you've done the
>> #hashMultiply primitive already.
>>
>> If the overhead of calling a primitive for each character proves to be
>> too much, even with the faster primitive calling methodologies you
>> talked about in the "Cog Primitive Performance" thread on the Vm-dev
>> list, a more complex primitive could take a range of bytes, so large
>> strings would be done in batches, solving the latency problem.
>>
>>>
>>> Let's look at the basic hash algorithm.
>> [...]
>>>
>>> In looking at this I've added a primitive for hashMultiply; primitive
>>> #159 implements precisely self * 1664525 bitAnd: 16r0FFFFFFF for
>>> SmallInteger and LargePositiveInteger receivers, as fast as possible
>>> in the Cog JIT.  With this machinery in place it's instructive to
>>> compare the cost of the primitive against the non-primitive Smalltalk
>>> code.
>>>
>>> First let me introduce a set of replacement hash functions, newHashN.
>>> These hash all characters in strings up to a certain size, and then no
>>> more than that number for larger strings.  Here are newHash64 and
>>> newHash2048, which use pure Smalltalk, including an inlined
>>> hashMultiply written to avoid SmallInteger overflow.  Also measured
>>> are the obvious variants newHash128, newHash256, newHash512 & mewHash1024.
>>>
>>>
>> [...]
>>> So the idea here is to step through the string by 1 for strings sizes
>>> up to N - 1, and by greater than 1 for strings of size >= N, limiting
>>> the maximum number of characters sampled to between N // 2 and N - 1.
>>
>> The history of computing is littered with the bones of those who have
>> tried this kind of thing. It doesn't end well. Yes, you get a faster
>> hash function. And then you find, for sets of data that you or your
>> users actually want to use, that you get collisions like crazy, and much
>> worse overall performance than you started with.
>>
>> Sure, it works OK for the sets of data that the designer *tested*, and
>> probably for the sets of data that they *anticipated*. But real-world
>> data is tricky. It includes data sets where the characters that differ
>> are the ones that the hash thinks are unimportant, and there goes your
>> performance, by orders of magnitude. For instance, early versions of
>> Java used a limited number of characters to hash strings. One of the
>> biggest compatibility-breaking changes they were forced to make in later
>> Java versions was to consider *all* characters in hashing. It turned out
>> that it was very common to hash URLs, and many distinct URLs had most of
>> their characters in common.
>>
>> And you don't always get to choose your data -- sometimes you have an
>> opponent who is actively looking to create collisions as a
>> denial-of-service attack. There was a fair-sized kerfluffle about this a
>> few years ago -- most web scripting languages made it too easy to mount
>> this kind of attack.
>>
> https://arstechnica.com/business/2011/12/huge-portions-of-web-vulnerable-to-hashing-denial-of-service-attack/
>> "...an attacker can degenerate the hash table by sending lots of
>> colliding keys. ...making it possible to exhaust hours of CPU time using
>> a single HTTP request."
>> To guard against this kind of attack you need a randomized element in
>> your hash (not a bad idea for Smalltalk, actually, and pretty easy --
>> mixing in the identity hash of the collection might be sufficient) or a
>> cryptographic hash (not worth the computational expense for most
>> purposes). However, even adding a randomized element would not prevent
>> this kind of attack if you predictably completely ignore some characters
>> of the input string. That just makes it *so* easy to generate data that
>> validates, and is not equal, but causes collisions.
>>
>> So really, for general-purpose use (i.e. what's built into the language)
>> hash *every* character of *all* strings. If someone finds that this is a
>> performance problem in a real-world situation, it can be addressed in an
>> application-specific way.
>
>

Reply | Threaded
Open this post in threaded view
|

Re: [Vm-dev] [Pharo-dev] Byte & String collection hash performance; a modest proposal for change.

Levente Uzonyi
I presume that a general purpose in-image solution would be more complex.
String already has too many subclasses (6 in Squeak), while at the same
time other kind of new subclasses would be welcome too, e.g. Strings with
2-byte characters.
Since these properties are orthogonal, there would be many new subclasses
to cover all cases.
Storing the size of the string in a different header word is a VM specific
detail, so I think caching the hash could also be hidden from the image.

Levente

On Mon, 1 May 2017, David T. Lewis wrote:

>
> Does it need to be done in the VM? Why not make a class LargeString with
> instance variables aString and myCalculatedHashValueForTheString. That way
> you can cache the hash value and calculate it any way you want.
>
> I know vey little about hashing, just wondering if this kind of thing can
> be handled more easily in the image.
>
> Dave
>
>
>>
>> On Mon, 1 May 2017, Levente Uzonyi wrote:
>>
>>> Well, I had started to write a reply, but I had to postpone it.
>>> I mostly agree with your suggestions.
>>>
>>> One thing that can be done about large strings is to cache the
>>> calculated
>>> hash value in the larger strings. Currently the object representation
>>> changes when the string contains 255 or more characters. In that case an
>>> additional 64 bit field is added to the object header to store its
>>> length.
>>> If we were to use the upper 28+1 bits of that field to cache the hash,
>>> there would still be 35-bits to encode length, which would be enough to
>>> represent strings up to 8 GiB.
>>
>> Well, we can keep the whole range minus one bit, but then we can't store
>> the hash for strings larger than 8 GiB.
>>
>> Levente
>>
>>> But this would require further VM changes (e.g. at:put: would have to
>>> flush the cache).
>>>
>>> Levente
>>>
>>> On Mon, 1 May 2017, Martin McClure wrote:
>>>
>>>> I see no replies to this on any of the three lists it was sent to, so I
>>>> guess I'll chime in.
>>>>
>>>> tl;dr:
>>>> Making a primitive for #hashMultiply, probably a good idea, in some
>>>> form, since doing it in Smalltalk is awkward.
>>>> Only hashing every N-th character of large strings, probably a very bad
>>>> idea. Performance might well get worse, the complexity does not seem
>>>> justified, and it would open a sizeable security hole.
>>>>
>>>> More verbiage below for those interested.
>>>>
>>>> Regards,
>>>> -Martin
>>>>
>>>> On 04/18/2017 07:09 PM, Eliot Miranda wrote:
>>>>> Hi All,
>>>>>
>>>>>     the hash algorithm used for ByteString in Squeak and Pharo is good
>>>>> for "small" strings and overkill for large strings.
>>>>
>>>> Why do you say it's overkill for large strings? Are there applications
>>>> with large strings that are being negatively impacted by the current
>>>> algorithm? Which ones, and impacted how?
>>>>
>>>>> It is important in many applications to get well distributed string
>>>>> hashes, especially over the range of strings that constitute things
>>>>> like method names, URLs, etc.  Consequently, the current algorithm
>>>>> includes every character in a string.  This works very well for
>>>>> "small" strings and results in very slow hashes (and hence long
>>>>> latencies, because the hash is an uninterruptible primitive) for large
>>>>> strings, where large may be several megabytes.
>>>>
>>>> A simple solution for the uninterruptable primitive is to not make it a
>>>> primitive. Make #hashMultiply a primitive (since this particular kind
>>>> of
>>>> numeric modulo computation is really painful in Smalltalk), and do the
>>>> rest in a loop in Smalltalk. It sounds like you've done the
>>>> #hashMultiply primitive already.
>>>>
>>>> If the overhead of calling a primitive for each character proves to be
>>>> too much, even with the faster primitive calling methodologies you
>>>> talked about in the "Cog Primitive Performance" thread on the Vm-dev
>>>> list, a more complex primitive could take a range of bytes, so large
>>>> strings would be done in batches, solving the latency problem.
>>>>
>>>>>
>>>>> Let's look at the basic hash algorithm.
>>>> [...]
>>>>>
>>>>> In looking at this I've added a primitive for hashMultiply; primitive
>>>>> #159 implements precisely self * 1664525 bitAnd: 16r0FFFFFFF for
>>>>> SmallInteger and LargePositiveInteger receivers, as fast as possible
>>>>> in the Cog JIT.  With this machinery in place it's instructive to
>>>>> compare the cost of the primitive against the non-primitive Smalltalk
>>>>> code.
>>>>>
>>>>> First let me introduce a set of replacement hash functions, newHashN.
>>>>> These hash all characters in strings up to a certain size, and then no
>>>>> more than that number for larger strings.  Here are newHash64 and
>>>>> newHash2048, which use pure Smalltalk, including an inlined
>>>>> hashMultiply written to avoid SmallInteger overflow.  Also measured
>>>>> are the obvious variants newHash128, newHash256, newHash512 &
>>>>> mewHash1024.
>>>>>
>>>>>
>>>> [...]
>>>>> So the idea here is to step through the string by 1 for strings sizes
>>>>> up to N - 1, and by greater than 1 for strings of size >= N, limiting
>>>>> the maximum number of characters sampled to between N // 2 and N - 1.
>>>>
>>>> The history of computing is littered with the bones of those who have
>>>> tried this kind of thing. It doesn't end well. Yes, you get a faster
>>>> hash function. And then you find, for sets of data that you or your
>>>> users actually want to use, that you get collisions like crazy, and
>>>> much
>>>> worse overall performance than you started with.
>>>>
>>>> Sure, it works OK for the sets of data that the designer *tested*, and
>>>> probably for the sets of data that they *anticipated*. But real-world
>>>> data is tricky. It includes data sets where the characters that differ
>>>> are the ones that the hash thinks are unimportant, and there goes your
>>>> performance, by orders of magnitude. For instance, early versions of
>>>> Java used a limited number of characters to hash strings. One of the
>>>> biggest compatibility-breaking changes they were forced to make in
>>>> later
>>>> Java versions was to consider *all* characters in hashing. It turned
>>>> out
>>>> that it was very common to hash URLs, and many distinct URLs had most
>>>> of
>>>> their characters in common.
>>>>
>>>> And you don't always get to choose your data -- sometimes you have an
>>>> opponent who is actively looking to create collisions as a
>>>> denial-of-service attack. There was a fair-sized kerfluffle about this
>>>> a
>>>> few years ago -- most web scripting languages made it too easy to mount
>>>> this kind of attack.
>>>>
>>> https://arstechnica.com/business/2011/12/huge-portions-of-web-vulnerable-to-hashing-denial-of-service-attack/
>>>> "...an attacker can degenerate the hash table by sending lots of
>>>> colliding keys. ...making it possible to exhaust hours of CPU time
>>>> using
>>>> a single HTTP request."
>>>> To guard against this kind of attack you need a randomized element in
>>>> your hash (not a bad idea for Smalltalk, actually, and pretty easy --
>>>> mixing in the identity hash of the collection might be sufficient) or a
>>>> cryptographic hash (not worth the computational expense for most
>>>> purposes). However, even adding a randomized element would not prevent
>>>> this kind of attack if you predictably completely ignore some
>>>> characters
>>>> of the input string. That just makes it *so* easy to generate data that
>>>> validates, and is not equal, but causes collisions.
>>>>
>>>> So really, for general-purpose use (i.e. what's built into the
>>>> language)
>>>> hash *every* character of *all* strings. If someone finds that this is
>>>> a
>>>> performance problem in a real-world situation, it can be addressed in
>>>> an
>>>> application-specific way.
>>>
>>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: [Vm-dev] [Pharo-dev] Byte & String collection hash performance; a modest proposal for change.

Tony Garnock-Jones-3
On 5/1/17 1:26 PM, Levente Uzonyi wrote:
> I presume that a general purpose in-image solution would be more
> complex. String already has too many subclasses (6 in Squeak), while at
> the same time other kind of new subclasses would be welcome too, e.g.
> Strings with 2-byte characters.
> Since these properties are orthogonal, there would be many new
> subclasses to cover all cases.

A classic motivating case for Traits, right?

Tony

Reply | Threaded
Open this post in threaded view
|

Re: [Vm-dev] [Pharo-dev] Byte & String collection hash performance; a modest proposal for change.

Eliot Miranda-2


On Mon, May 1, 2017 at 10:37 AM, Tony Garnock-Jones <[hidden email]> wrote:
On 5/1/17 1:26 PM, Levente Uzonyi wrote:
> I presume that a general purpose in-image solution would be more
> complex. String already has too many subclasses (6 in Squeak), while at
> the same time other kind of new subclasses would be welcome too, e.g.
> Strings with 2-byte characters.
> Since these properties are orthogonal, there would be many new
> subclasses to cover all cases.

A classic motivating case for Traits, right?

Alas no.  The problem with String's subclasses is that they're binary objects.  They have no inst vars in which one can cache a hash.  They're juts a flat vector of bytes.  If one added a trait to them the system would fail because there's no way to add an inst var to a binary object in the current Smalltalk object representation.  hence Levente's very clever idea of hiding the hash in a hidden header word.


Tony




--
_,,,^..^,,,_
best, Eliot


Reply | Threaded
Open this post in threaded view
|

Re: [Vm-dev] [Pharo-dev] Byte & String collection hash performance; a modest proposal for change.

Tony Garnock-Jones-3
On 5/1/17 2:55 PM, Eliot Miranda wrote:
> Alas no.  The problem with String's subclasses is that they're binary
> objects.  They have no inst vars in which one can cache a hash.

I see. That suggests a deeper refactoring, of course, to separate the
vector of bytes from the rest of the instance, at least in some cases.
It'd be a lot of design and implementation work.

Reply | Threaded
Open this post in threaded view
|

Re: [Vm-dev] [Pharo-dev] Byte & String collection hash performance; a modest proposal for change.

Levente Uzonyi
In reply to this post by Tony Garnock-Jones-3
I have a really strong opinion on Traits, which is: they are bad, because
they are stateless. I proposed a better solution, which would have allowed
you to work with classes only, but no one showed interest in it.

Levente

On Mon, 1 May 2017, Tony Garnock-Jones wrote:

> On 5/1/17 1:26 PM, Levente Uzonyi wrote:
>> I presume that a general purpose in-image solution would be more
>> complex. String already has too many subclasses (6 in Squeak), while at
>> the same time other kind of new subclasses would be welcome too, e.g.
>> Strings with 2-byte characters.
>> Since these properties are orthogonal, there would be many new
>> subclasses to cover all cases.
>
> A classic motivating case for Traits, right?
>
> Tony
>

Reply | Threaded
Open this post in threaded view
|

Re: [Vm-dev] [Pharo-dev] Byte & String collection hash performance; a modest proposal for change.

David T. Lewis
In reply to this post by Levente Uzonyi
On Mon, May 01, 2017 at 07:26:35PM +0200, Levente Uzonyi wrote:

> I presume that a general purpose in-image solution would be more complex.
> String already has too many subclasses (6 in Squeak), while at the same
> time other kind of new subclasses would be welcome too, e.g. Strings with
> 2-byte characters.
> Since these properties are orthogonal, there would be many new subclasses
> to cover all cases.
> Storing the size of the string in a different header word is a VM specific
> detail, so I think caching the hash could also be hidden from the image.
>
> Levente

Actually, I meant something more like this:

   Object subclass: #LargeString
        instanceVariableNames: 'anyKindOfString cachedHashValueForTheString'
        classVariableNames: ''
        poolDictionaries: ''
        category: 'Probably a Bad Idea'


I was guessing that hashing very large strings would imply a somewhat
specialized problem domain, for which a wrapper class might make sense.
Certainly it would not be a general solution.

I am probably over my quota of bad ideas for today, so I'll stop now ;-)

Dave


>
> On Mon, 1 May 2017, David T. Lewis wrote:
>
> >
> >Does it need to be done in the VM? Why not make a class LargeString with
> >instance variables aString and myCalculatedHashValueForTheString. That way
> >you can cache the hash value and calculate it any way you want.
> >
> >I know vey little about hashing, just wondering if this kind of thing can
> >be handled more easily in the image.
> >
> >Dave
> >
> >
> >>
> >>On Mon, 1 May 2017, Levente Uzonyi wrote:
> >>
> >>>Well, I had started to write a reply, but I had to postpone it.
> >>>I mostly agree with your suggestions.
> >>>
> >>>One thing that can be done about large strings is to cache the
> >>>calculated
> >>>hash value in the larger strings. Currently the object representation
> >>>changes when the string contains 255 or more characters. In that case an
> >>>additional 64 bit field is added to the object header to store its
> >>>length.
> >>>If we were to use the upper 28+1 bits of that field to cache the hash,
> >>>there would still be 35-bits to encode length, which would be enough to
> >>>represent strings up to 8 GiB.
> >>
> >>Well, we can keep the whole range minus one bit, but then we can't store
> >>the hash for strings larger than 8 GiB.
> >>
> >>Levente
> >>
> >>>But this would require further VM changes (e.g. at:put: would have to
> >>>flush the cache).
> >>>
> >>>Levente
> >>>
> >>>On Mon, 1 May 2017, Martin McClure wrote:
> >>>
> >>>>I see no replies to this on any of the three lists it was sent to, so I
> >>>>guess I'll chime in.
> >>>>
> >>>>tl;dr:
> >>>>Making a primitive for #hashMultiply, probably a good idea, in some
> >>>>form, since doing it in Smalltalk is awkward.
> >>>>Only hashing every N-th character of large strings, probably a very bad
> >>>>idea. Performance might well get worse, the complexity does not seem
> >>>>justified, and it would open a sizeable security hole.
> >>>>
> >>>>More verbiage below for those interested.
> >>>>
> >>>>Regards,
> >>>>-Martin
> >>>>
> >>>>On 04/18/2017 07:09 PM, Eliot Miranda wrote:
> >>>>>Hi All,
> >>>>>
> >>>>>    the hash algorithm used for ByteString in Squeak and Pharo is good
> >>>>>for "small" strings and overkill for large strings.
> >>>>
> >>>>Why do you say it's overkill for large strings? Are there applications
> >>>>with large strings that are being negatively impacted by the current
> >>>>algorithm? Which ones, and impacted how?
> >>>>
> >>>>>It is important in many applications to get well distributed string
> >>>>>hashes, especially over the range of strings that constitute things
> >>>>>like method names, URLs, etc.  Consequently, the current algorithm
> >>>>>includes every character in a string.  This works very well for
> >>>>>"small" strings and results in very slow hashes (and hence long
> >>>>>latencies, because the hash is an uninterruptible primitive) for large
> >>>>>strings, where large may be several megabytes.
> >>>>
> >>>>A simple solution for the uninterruptable primitive is to not make it a
> >>>>primitive. Make #hashMultiply a primitive (since this particular kind
> >>>>of
> >>>>numeric modulo computation is really painful in Smalltalk), and do the
> >>>>rest in a loop in Smalltalk. It sounds like you've done the
> >>>>#hashMultiply primitive already.
> >>>>
> >>>>If the overhead of calling a primitive for each character proves to be
> >>>>too much, even with the faster primitive calling methodologies you
> >>>>talked about in the "Cog Primitive Performance" thread on the Vm-dev
> >>>>list, a more complex primitive could take a range of bytes, so large
> >>>>strings would be done in batches, solving the latency problem.
> >>>>
> >>>>>
> >>>>>Let's look at the basic hash algorithm.
> >>>>[...]
> >>>>>
> >>>>>In looking at this I've added a primitive for hashMultiply; primitive
> >>>>>#159 implements precisely self * 1664525 bitAnd: 16r0FFFFFFF for
> >>>>>SmallInteger and LargePositiveInteger receivers, as fast as possible
> >>>>>in the Cog JIT.  With this machinery in place it's instructive to
> >>>>>compare the cost of the primitive against the non-primitive Smalltalk
> >>>>>code.
> >>>>>
> >>>>>First let me introduce a set of replacement hash functions, newHashN.
> >>>>>These hash all characters in strings up to a certain size, and then no
> >>>>>more than that number for larger strings.  Here are newHash64 and
> >>>>>newHash2048, which use pure Smalltalk, including an inlined
> >>>>>hashMultiply written to avoid SmallInteger overflow.  Also measured
> >>>>>are the obvious variants newHash128, newHash256, newHash512 &
> >>>>>mewHash1024.
> >>>>>
> >>>>>
> >>>>[...]
> >>>>>So the idea here is to step through the string by 1 for strings sizes
> >>>>>up to N - 1, and by greater than 1 for strings of size >= N, limiting
> >>>>>the maximum number of characters sampled to between N // 2 and N - 1.
> >>>>
> >>>>The history of computing is littered with the bones of those who have
> >>>>tried this kind of thing. It doesn't end well. Yes, you get a faster
> >>>>hash function. And then you find, for sets of data that you or your
> >>>>users actually want to use, that you get collisions like crazy, and
> >>>>much
> >>>>worse overall performance than you started with.
> >>>>
> >>>>Sure, it works OK for the sets of data that the designer *tested*, and
> >>>>probably for the sets of data that they *anticipated*. But real-world
> >>>>data is tricky. It includes data sets where the characters that differ
> >>>>are the ones that the hash thinks are unimportant, and there goes your
> >>>>performance, by orders of magnitude. For instance, early versions of
> >>>>Java used a limited number of characters to hash strings. One of the
> >>>>biggest compatibility-breaking changes they were forced to make in
> >>>>later
> >>>>Java versions was to consider *all* characters in hashing. It turned
> >>>>out
> >>>>that it was very common to hash URLs, and many distinct URLs had most
> >>>>of
> >>>>their characters in common.
> >>>>
> >>>>And you don't always get to choose your data -- sometimes you have an
> >>>>opponent who is actively looking to create collisions as a
> >>>>denial-of-service attack. There was a fair-sized kerfluffle about this
> >>>>a
> >>>>few years ago -- most web scripting languages made it too easy to mount
> >>>>this kind of attack.
> >>>>
> >>>https://arstechnica.com/business/2011/12/huge-portions-of-web-vulnerable-to-hashing-denial-of-service-attack/
> >>>>"...an attacker can degenerate the hash table by sending lots of
> >>>>colliding keys. ...making it possible to exhaust hours of CPU time
> >>>>using
> >>>>a single HTTP request."
> >>>>To guard against this kind of attack you need a randomized element in
> >>>>your hash (not a bad idea for Smalltalk, actually, and pretty easy --
> >>>>mixing in the identity hash of the collection might be sufficient) or a
> >>>>cryptographic hash (not worth the computational expense for most
> >>>>purposes). However, even adding a randomized element would not prevent
> >>>>this kind of attack if you predictably completely ignore some
> >>>>characters
> >>>>of the input string. That just makes it *so* easy to generate data that
> >>>>validates, and is not equal, but causes collisions.
> >>>>
> >>>>So really, for general-purpose use (i.e. what's built into the
> >>>>language)
> >>>>hash *every* character of *all* strings. If someone finds that this is
> >>>>a
> >>>>performance problem in a real-world situation, it can be addressed in
> >>>>an
> >>>>application-specific way.
> >>>
> >>>
> >>
>

Reply | Threaded
Open this post in threaded view
|

Re: [Vm-dev] [Pharo-dev] Byte & String collection hash performance; a modest proposal for change.

Levente Uzonyi
On Mon, 1 May 2017, David T. Lewis wrote:

> On Mon, May 01, 2017 at 07:26:35PM +0200, Levente Uzonyi wrote:
>> I presume that a general purpose in-image solution would be more complex.
>> String already has too many subclasses (6 in Squeak), while at the same
>> time other kind of new subclasses would be welcome too, e.g. Strings with
>> 2-byte characters.
>> Since these properties are orthogonal, there would be many new subclasses
>> to cover all cases.
>> Storing the size of the string in a different header word is a VM specific
>> detail, so I think caching the hash could also be hidden from the image.
>>
>> Levente
>
> Actually, I meant something more like this:
>
>   Object subclass: #LargeString
> instanceVariableNames: 'anyKindOfString cachedHashValueForTheString'
> classVariableNames: ''
> poolDictionaries: ''
> category: 'Probably a Bad Idea'
>
>
> I was guessing that hashing very large strings would imply a somewhat
> specialized problem domain, for which a wrapper class might make sense.
> Certainly it would not be a general solution.
>
> I am probably over my quota of bad ideas for today, so I'll stop now ;-)

That's one way to solve the problem. It works, but you have to wrap all
your strings, or they won't be equal to non-wrapped strings. So, it's
not a transparent solution.
If you just want to use a different, expectedly quicker hash function,
you can always use PluggableSet and PluggableDictionary.
(If we had a really good JIT, using them would have no overhead compared
to Set and Dictionary.)

Levente

>
> Dave
>
>
>>
>> On Mon, 1 May 2017, David T. Lewis wrote:
>>
>> >
>> >Does it need to be done in the VM? Why not make a class LargeString with
>> >instance variables aString and myCalculatedHashValueForTheString. That way
>> >you can cache the hash value and calculate it any way you want.
>> >
>> >I know vey little about hashing, just wondering if this kind of thing can
>> >be handled more easily in the image.
>> >
>> >Dave
>> >
>> >
>> >>
>> >>On Mon, 1 May 2017, Levente Uzonyi wrote:
>> >>
>> >>>Well, I had started to write a reply, but I had to postpone it.
>> >>>I mostly agree with your suggestions.
>> >>>
>> >>>One thing that can be done about large strings is to cache the
>> >>>calculated
>> >>>hash value in the larger strings. Currently the object representation
>> >>>changes when the string contains 255 or more characters. In that case an
>> >>>additional 64 bit field is added to the object header to store its
>> >>>length.
>> >>>If we were to use the upper 28+1 bits of that field to cache the hash,
>> >>>there would still be 35-bits to encode length, which would be enough to
>> >>>represent strings up to 8 GiB.
>> >>
>> >>Well, we can keep the whole range minus one bit, but then we can't store
>> >>the hash for strings larger than 8 GiB.
>> >>
>> >>Levente
>> >>
>> >>>But this would require further VM changes (e.g. at:put: would have to
>> >>>flush the cache).
>> >>>
>> >>>Levente
>> >>>
>> >>>On Mon, 1 May 2017, Martin McClure wrote:
>> >>>
>> >>>>I see no replies to this on any of the three lists it was sent to, so I
>> >>>>guess I'll chime in.
>> >>>>
>> >>>>tl;dr:
>> >>>>Making a primitive for #hashMultiply, probably a good idea, in some
>> >>>>form, since doing it in Smalltalk is awkward.
>> >>>>Only hashing every N-th character of large strings, probably a very bad
>> >>>>idea. Performance might well get worse, the complexity does not seem
>> >>>>justified, and it would open a sizeable security hole.
>> >>>>
>> >>>>More verbiage below for those interested.
>> >>>>
>> >>>>Regards,
>> >>>>-Martin
>> >>>>
>> >>>>On 04/18/2017 07:09 PM, Eliot Miranda wrote:
>> >>>>>Hi All,
>> >>>>>
>> >>>>>    the hash algorithm used for ByteString in Squeak and Pharo is good
>> >>>>>for "small" strings and overkill for large strings.
>> >>>>
>> >>>>Why do you say it's overkill for large strings? Are there applications
>> >>>>with large strings that are being negatively impacted by the current
>> >>>>algorithm? Which ones, and impacted how?
>> >>>>
>> >>>>>It is important in many applications to get well distributed string
>> >>>>>hashes, especially over the range of strings that constitute things
>> >>>>>like method names, URLs, etc.  Consequently, the current algorithm
>> >>>>>includes every character in a string.  This works very well for
>> >>>>>"small" strings and results in very slow hashes (and hence long
>> >>>>>latencies, because the hash is an uninterruptible primitive) for large
>> >>>>>strings, where large may be several megabytes.
>> >>>>
>> >>>>A simple solution for the uninterruptable primitive is to not make it a
>> >>>>primitive. Make #hashMultiply a primitive (since this particular kind
>> >>>>of
>> >>>>numeric modulo computation is really painful in Smalltalk), and do the
>> >>>>rest in a loop in Smalltalk. It sounds like you've done the
>> >>>>#hashMultiply primitive already.
>> >>>>
>> >>>>If the overhead of calling a primitive for each character proves to be
>> >>>>too much, even with the faster primitive calling methodologies you
>> >>>>talked about in the "Cog Primitive Performance" thread on the Vm-dev
>> >>>>list, a more complex primitive could take a range of bytes, so large
>> >>>>strings would be done in batches, solving the latency problem.
>> >>>>
>> >>>>>
>> >>>>>Let's look at the basic hash algorithm.
>> >>>>[...]
>> >>>>>
>> >>>>>In looking at this I've added a primitive for hashMultiply; primitive
>> >>>>>#159 implements precisely self * 1664525 bitAnd: 16r0FFFFFFF for
>> >>>>>SmallInteger and LargePositiveInteger receivers, as fast as possible
>> >>>>>in the Cog JIT.  With this machinery in place it's instructive to
>> >>>>>compare the cost of the primitive against the non-primitive Smalltalk
>> >>>>>code.
>> >>>>>
>> >>>>>First let me introduce a set of replacement hash functions, newHashN.
>> >>>>>These hash all characters in strings up to a certain size, and then no
>> >>>>>more than that number for larger strings.  Here are newHash64 and
>> >>>>>newHash2048, which use pure Smalltalk, including an inlined
>> >>>>>hashMultiply written to avoid SmallInteger overflow.  Also measured
>> >>>>>are the obvious variants newHash128, newHash256, newHash512 &
>> >>>>>mewHash1024.
>> >>>>>
>> >>>>>
>> >>>>[...]
>> >>>>>So the idea here is to step through the string by 1 for strings sizes
>> >>>>>up to N - 1, and by greater than 1 for strings of size >= N, limiting
>> >>>>>the maximum number of characters sampled to between N // 2 and N - 1.
>> >>>>
>> >>>>The history of computing is littered with the bones of those who have
>> >>>>tried this kind of thing. It doesn't end well. Yes, you get a faster
>> >>>>hash function. And then you find, for sets of data that you or your
>> >>>>users actually want to use, that you get collisions like crazy, and
>> >>>>much
>> >>>>worse overall performance than you started with.
>> >>>>
>> >>>>Sure, it works OK for the sets of data that the designer *tested*, and
>> >>>>probably for the sets of data that they *anticipated*. But real-world
>> >>>>data is tricky. It includes data sets where the characters that differ
>> >>>>are the ones that the hash thinks are unimportant, and there goes your
>> >>>>performance, by orders of magnitude. For instance, early versions of
>> >>>>Java used a limited number of characters to hash strings. One of the
>> >>>>biggest compatibility-breaking changes they were forced to make in
>> >>>>later
>> >>>>Java versions was to consider *all* characters in hashing. It turned
>> >>>>out
>> >>>>that it was very common to hash URLs, and many distinct URLs had most
>> >>>>of
>> >>>>their characters in common.
>> >>>>
>> >>>>And you don't always get to choose your data -- sometimes you have an
>> >>>>opponent who is actively looking to create collisions as a
>> >>>>denial-of-service attack. There was a fair-sized kerfluffle about this
>> >>>>a
>> >>>>few years ago -- most web scripting languages made it too easy to mount
>> >>>>this kind of attack.
>> >>>>
>> >>>https://arstechnica.com/business/2011/12/huge-portions-of-web-vulnerable-to-hashing-denial-of-service-attack/
>> >>>>"...an attacker can degenerate the hash table by sending lots of
>> >>>>colliding keys. ...making it possible to exhaust hours of CPU time
>> >>>>using
>> >>>>a single HTTP request."
>> >>>>To guard against this kind of attack you need a randomized element in
>> >>>>your hash (not a bad idea for Smalltalk, actually, and pretty easy --
>> >>>>mixing in the identity hash of the collection might be sufficient) or a
>> >>>>cryptographic hash (not worth the computational expense for most
>> >>>>purposes). However, even adding a randomized element would not prevent
>> >>>>this kind of attack if you predictably completely ignore some
>> >>>>characters
>> >>>>of the input string. That just makes it *so* easy to generate data that
>> >>>>validates, and is not equal, but causes collisions.
>> >>>>
>> >>>>So really, for general-purpose use (i.e. what's built into the
>> >>>>language)
>> >>>>hash *every* character of *all* strings. If someone finds that this is
>> >>>>a
>> >>>>performance problem in a real-world situation, it can be addressed in
>> >>>>an
>> >>>>application-specific way.
>> >>>
>> >>>
>> >>
>>

Reply | Threaded
Open this post in threaded view
|

Re: [Vm-dev] [Pharo-dev] Byte & String collection hash performance; a modest proposal for change.

Eliot Miranda-2
In reply to this post by David T. Lewis


On Mon, May 1, 2017 at 5:11 PM, David T. Lewis <[hidden email]> wrote:
On Mon, May 01, 2017 at 07:26:35PM +0200, Levente Uzonyi wrote:
> I presume that a general purpose in-image solution would be more complex.
> String already has too many subclasses (6 in Squeak), while at the same
> time other kind of new subclasses would be welcome too, e.g. Strings with
> 2-byte characters.
> Since these properties are orthogonal, there would be many new subclasses
> to cover all cases.
> Storing the size of the string in a different header word is a VM specific
> detail, so I think caching the hash could also be hidden from the image.
>
> Levente

Actually, I meant something more like this:

   Object subclass: #LargeString
        instanceVariableNames: 'anyKindOfString cachedHashValueForTheString'
        classVariableNames: ''
        poolDictionaries: ''
        category: 'Probably a Bad Idea'


I was guessing that hashing very large strings would imply a somewhat
specialized problem domain, for which a wrapper class might make sense.
Certainly it would not be a general solution.

But this implies changing very many places in the VM that access a string as a vector of bytes.  That's why Levente suggests hiding the hash in unused memory in Spur.  It doesn't have the side effect of requiring a major rewrite.

 

I am probably over my quota of bad ideas for today, so I'll stop now ;-)

Dave


>
> On Mon, 1 May 2017, David T. Lewis wrote:
>
> >
> >Does it need to be done in the VM? Why not make a class LargeString with
> >instance variables aString and myCalculatedHashValueForTheString. That way
> >you can cache the hash value and calculate it any way you want.
> >
> >I know vey little about hashing, just wondering if this kind of thing can
> >be handled more easily in the image.
> >
> >Dave
> >
> >
> >>
> >>On Mon, 1 May 2017, Levente Uzonyi wrote:
> >>
> >>>Well, I had started to write a reply, but I had to postpone it.
> >>>I mostly agree with your suggestions.
> >>>
> >>>One thing that can be done about large strings is to cache the
> >>>calculated
> >>>hash value in the larger strings. Currently the object representation
> >>>changes when the string contains 255 or more characters. In that case an
> >>>additional 64 bit field is added to the object header to store its
> >>>length.
> >>>If we were to use the upper 28+1 bits of that field to cache the hash,
> >>>there would still be 35-bits to encode length, which would be enough to
> >>>represent strings up to 8 GiB.
> >>
> >>Well, we can keep the whole range minus one bit, but then we can't store
> >>the hash for strings larger than 8 GiB.
> >>
> >>Levente
> >>
> >>>But this would require further VM changes (e.g. at:put: would have to
> >>>flush the cache).
> >>>
> >>>Levente
> >>>
> >>>On Mon, 1 May 2017, Martin McClure wrote:
> >>>
> >>>>I see no replies to this on any of the three lists it was sent to, so I
> >>>>guess I'll chime in.
> >>>>
> >>>>tl;dr:
> >>>>Making a primitive for #hashMultiply, probably a good idea, in some
> >>>>form, since doing it in Smalltalk is awkward.
> >>>>Only hashing every N-th character of large strings, probably a very bad
> >>>>idea. Performance might well get worse, the complexity does not seem
> >>>>justified, and it would open a sizeable security hole.
> >>>>
> >>>>More verbiage below for those interested.
> >>>>
> >>>>Regards,
> >>>>-Martin
> >>>>
> >>>>On 04/18/2017 07:09 PM, Eliot Miranda wrote:
> >>>>>Hi All,
> >>>>>
> >>>>>    the hash algorithm used for ByteString in Squeak and Pharo is good
> >>>>>for "small" strings and overkill for large strings.
> >>>>
> >>>>Why do you say it's overkill for large strings? Are there applications
> >>>>with large strings that are being negatively impacted by the current
> >>>>algorithm? Which ones, and impacted how?
> >>>>
> >>>>>It is important in many applications to get well distributed string
> >>>>>hashes, especially over the range of strings that constitute things
> >>>>>like method names, URLs, etc.  Consequently, the current algorithm
> >>>>>includes every character in a string.  This works very well for
> >>>>>"small" strings and results in very slow hashes (and hence long
> >>>>>latencies, because the hash is an uninterruptible primitive) for large
> >>>>>strings, where large may be several megabytes.
> >>>>
> >>>>A simple solution for the uninterruptable primitive is to not make it a
> >>>>primitive. Make #hashMultiply a primitive (since this particular kind
> >>>>of
> >>>>numeric modulo computation is really painful in Smalltalk), and do the
> >>>>rest in a loop in Smalltalk. It sounds like you've done the
> >>>>#hashMultiply primitive already.
> >>>>
> >>>>If the overhead of calling a primitive for each character proves to be
> >>>>too much, even with the faster primitive calling methodologies you
> >>>>talked about in the "Cog Primitive Performance" thread on the Vm-dev
> >>>>list, a more complex primitive could take a range of bytes, so large
> >>>>strings would be done in batches, solving the latency problem.
> >>>>
> >>>>>
> >>>>>Let's look at the basic hash algorithm.
> >>>>[...]
> >>>>>
> >>>>>In looking at this I've added a primitive for hashMultiply; primitive
> >>>>>#159 implements precisely self * 1664525 bitAnd: 16r0FFFFFFF for
> >>>>>SmallInteger and LargePositiveInteger receivers, as fast as possible
> >>>>>in the Cog JIT.  With this machinery in place it's instructive to
> >>>>>compare the cost of the primitive against the non-primitive Smalltalk
> >>>>>code.
> >>>>>
> >>>>>First let me introduce a set of replacement hash functions, newHashN.
> >>>>>These hash all characters in strings up to a certain size, and then no
> >>>>>more than that number for larger strings.  Here are newHash64 and
> >>>>>newHash2048, which use pure Smalltalk, including an inlined
> >>>>>hashMultiply written to avoid SmallInteger overflow.  Also measured
> >>>>>are the obvious variants newHash128, newHash256, newHash512 &
> >>>>>mewHash1024.
> >>>>>
> >>>>>
> >>>>[...]
> >>>>>So the idea here is to step through the string by 1 for strings sizes
> >>>>>up to N - 1, and by greater than 1 for strings of size >= N, limiting
> >>>>>the maximum number of characters sampled to between N // 2 and N - 1.
> >>>>
> >>>>The history of computing is littered with the bones of those who have
> >>>>tried this kind of thing. It doesn't end well. Yes, you get a faster
> >>>>hash function. And then you find, for sets of data that you or your
> >>>>users actually want to use, that you get collisions like crazy, and
> >>>>much
> >>>>worse overall performance than you started with.
> >>>>
> >>>>Sure, it works OK for the sets of data that the designer *tested*, and
> >>>>probably for the sets of data that they *anticipated*. But real-world
> >>>>data is tricky. It includes data sets where the characters that differ
> >>>>are the ones that the hash thinks are unimportant, and there goes your
> >>>>performance, by orders of magnitude. For instance, early versions of
> >>>>Java used a limited number of characters to hash strings. One of the
> >>>>biggest compatibility-breaking changes they were forced to make in
> >>>>later
> >>>>Java versions was to consider *all* characters in hashing. It turned
> >>>>out
> >>>>that it was very common to hash URLs, and many distinct URLs had most
> >>>>of
> >>>>their characters in common.
> >>>>
> >>>>And you don't always get to choose your data -- sometimes you have an
> >>>>opponent who is actively looking to create collisions as a
> >>>>denial-of-service attack. There was a fair-sized kerfluffle about this
> >>>>a
> >>>>few years ago -- most web scripting languages made it too easy to mount
> >>>>this kind of attack.
> >>>>
> >>>https://arstechnica.com/business/2011/12/huge-portions-of-web-vulnerable-to-hashing-denial-of-service-attack/
> >>>>"...an attacker can degenerate the hash table by sending lots of
> >>>>colliding keys. ...making it possible to exhaust hours of CPU time
> >>>>using
> >>>>a single HTTP request."
> >>>>To guard against this kind of attack you need a randomized element in
> >>>>your hash (not a bad idea for Smalltalk, actually, and pretty easy --
> >>>>mixing in the identity hash of the collection might be sufficient) or a
> >>>>cryptographic hash (not worth the computational expense for most
> >>>>purposes). However, even adding a randomized element would not prevent
> >>>>this kind of attack if you predictably completely ignore some
> >>>>characters
> >>>>of the input string. That just makes it *so* easy to generate data that
> >>>>validates, and is not equal, but causes collisions.
> >>>>
> >>>>So really, for general-purpose use (i.e. what's built into the
> >>>>language)
> >>>>hash *every* character of *all* strings. If someone finds that this is
> >>>>a
> >>>>performance problem in a real-world situation, it can be addressed in
> >>>>an
> >>>>application-specific way.
> >>>
> >>>
> >>
>




--
_,,,^..^,,,_
best, Eliot


Reply | Threaded
Open this post in threaded view
|

Re: [Vm-dev] [Pharo-dev] Byte & String collection hash performance; a modest proposal for change.

Eliot Miranda-2
In reply to this post by Levente Uzonyi


On Mon, May 1, 2017 at 5:42 PM, Levente Uzonyi <[hidden email]> wrote:
On Mon, 1 May 2017, David T. Lewis wrote:

On Mon, May 01, 2017 at 07:26:35PM +0200, Levente Uzonyi wrote:
I presume that a general purpose in-image solution would be more complex. String already has too many subclasses (6 in Squeak), while at the same time other kind of new subclasses would be welcome too, e.g. Strings with 2-byte characters.
Since these properties are orthogonal, there would be many new subclasses to cover all cases.
Storing the size of the string in a different header word is a VM specific detail, so I think caching the hash could also be hidden from the image.

Levente

Actually, I meant something more like this:

  Object subclass: #LargeString
        instanceVariableNames: 'anyKindOfString cachedHashValueForTheString'
        classVariableNames: ''
        poolDictionaries: ''
        category: 'Probably a Bad Idea'


I was guessing that hashing very large strings would imply a somewhat
specialized problem domain, for which a wrapper class might make sense.
Certainly it would not be a general solution.

I am probably over my quota of bad ideas for today, so I'll stop now ;-)

That's one way to solve the problem. It works, but you have to wrap all your strings, or they won't be equal to non-wrapped strings. So, it's not a transparent solution.
If you just want to use a different, expectedly quicker hash function, you can always use PluggableSet and PluggableDictionary.
(If we had a really good JIT, using them would have no overhead compared to Set and Dictionary.)

Right.  We have a crap JIT.  Even a "good JIT" wouldn't fix all the places in the VM source where a string is accessed.




Levente



Dave



On Mon, 1 May 2017, David T. Lewis wrote:

>
>Does it need to be done in the VM? Why not make a class LargeString with
>instance variables aString and myCalculatedHashValueForTheString. That way
>you can cache the hash value and calculate it any way you want.
>
>I know vey little about hashing, just wondering if this kind of thing can
>be handled more easily in the image.
>
>Dave
>
>
>>
>>On Mon, 1 May 2017, Levente Uzonyi wrote:
>>
>>>Well, I had started to write a reply, but I had to postpone it.
>>>I mostly agree with your suggestions.
>>>
>>>One thing that can be done about large strings is to cache the
>>>calculated
>>>hash value in the larger strings. Currently the object representation
>>>changes when the string contains 255 or more characters. In that case an
>>>additional 64 bit field is added to the object header to store its
>>>length.
>>>If we were to use the upper 28+1 bits of that field to cache the hash,
>>>there would still be 35-bits to encode length, which would be enough to
>>>represent strings up to 8 GiB.
>>
>>Well, we can keep the whole range minus one bit, but then we can't store
>>the hash for strings larger than 8 GiB.
>>
>>Levente
>>
>>>But this would require further VM changes (e.g. at:put: would have to
>>>flush the cache).
>>>
>>>Levente
>>>
>>>On Mon, 1 May 2017, Martin McClure wrote:
>>>
>>>>I see no replies to this on any of the three lists it was sent to, so I
>>>>guess I'll chime in.
>>>>
>>>>tl;dr:
>>>>Making a primitive for #hashMultiply, probably a good idea, in some
>>>>form, since doing it in Smalltalk is awkward.
>>>>Only hashing every N-th character of large strings, probably a very bad
>>>>idea. Performance might well get worse, the complexity does not seem
>>>>justified, and it would open a sizeable security hole.
>>>>
>>>>More verbiage below for those interested.
>>>>
>>>>Regards,
>>>>-Martin
>>>>
>>>>On 04/18/2017 07:09 PM, Eliot Miranda wrote:
>>>>>Hi All,
>>>>>
>>>>>    the hash algorithm used for ByteString in Squeak and Pharo is good
>>>>>for "small" strings and overkill for large strings.
>>>>
>>>>Why do you say it's overkill for large strings? Are there applications
>>>>with large strings that are being negatively impacted by the current
>>>>algorithm? Which ones, and impacted how?
>>>>
>>>>>It is important in many applications to get well distributed string
>>>>>hashes, especially over the range of strings that constitute things
>>>>>like method names, URLs, etc.  Consequently, the current algorithm
>>>>>includes every character in a string.  This works very well for
>>>>>"small" strings and results in very slow hashes (and hence long
>>>>>latencies, because the hash is an uninterruptible primitive) for large
>>>>>strings, where large may be several megabytes.
>>>>
>>>>A simple solution for the uninterruptable primitive is to not make it a
>>>>primitive. Make #hashMultiply a primitive (since this particular kind
>>>>of
>>>>numeric modulo computation is really painful in Smalltalk), and do the
>>>>rest in a loop in Smalltalk. It sounds like you've done the
>>>>#hashMultiply primitive already.
>>>>
>>>>If the overhead of calling a primitive for each character proves to be
>>>>too much, even with the faster primitive calling methodologies you
>>>>talked about in the "Cog Primitive Performance" thread on the Vm-dev
>>>>list, a more complex primitive could take a range of bytes, so large
>>>>strings would be done in batches, solving the latency problem.
>>>>
>>>>>
>>>>>Let's look at the basic hash algorithm.
>>>>[...]
>>>>>
>>>>>In looking at this I've added a primitive for hashMultiply; primitive
>>>>>#159 implements precisely self * 1664525 bitAnd: 16r0FFFFFFF for
>>>>>SmallInteger and LargePositiveInteger receivers, as fast as possible
>>>>>in the Cog JIT.  With this machinery in place it's instructive to
>>>>>compare the cost of the primitive against the non-primitive Smalltalk
>>>>>code.
>>>>>
>>>>>First let me introduce a set of replacement hash functions, newHashN.
>>>>>These hash all characters in strings up to a certain size, and then no
>>>>>more than that number for larger strings.  Here are newHash64 and
>>>>>newHash2048, which use pure Smalltalk, including an inlined
>>>>>hashMultiply written to avoid SmallInteger overflow.  Also measured
>>>>>are the obvious variants newHash128, newHash256, newHash512 &
>>>>>mewHash1024.
>>>>>
>>>>>
>>>>[...]
>>>>>So the idea here is to step through the string by 1 for strings sizes
>>>>>up to N - 1, and by greater than 1 for strings of size >= N, limiting
>>>>>the maximum number of characters sampled to between N // 2 and N - 1.
>>>>
>>>>The history of computing is littered with the bones of those who have
>>>>tried this kind of thing. It doesn't end well. Yes, you get a faster
>>>>hash function. And then you find, for sets of data that you or your
>>>>users actually want to use, that you get collisions like crazy, and
>>>>much
>>>>worse overall performance than you started with.
>>>>
>>>>Sure, it works OK for the sets of data that the designer *tested*, and
>>>>probably for the sets of data that they *anticipated*. But real-world
>>>>data is tricky. It includes data sets where the characters that differ
>>>>are the ones that the hash thinks are unimportant, and there goes your
>>>>performance, by orders of magnitude. For instance, early versions of
>>>>Java used a limited number of characters to hash strings. One of the
>>>>biggest compatibility-breaking changes they were forced to make in
>>>>later
>>>>Java versions was to consider *all* characters in hashing. It turned
>>>>out
>>>>that it was very common to hash URLs, and many distinct URLs had most
>>>>of
>>>>their characters in common.
>>>>
>>>>And you don't always get to choose your data -- sometimes you have an
>>>>opponent who is actively looking to create collisions as a
>>>>denial-of-service attack. There was a fair-sized kerfluffle about this
>>>>a
>>>>few years ago -- most web scripting languages made it too easy to mount
>>>>this kind of attack.
>>>>
>>>https://arstechnica.com/business/2011/12/huge-portions-of-web-vulnerable-to-hashing-denial-of-service-attack/
>>>>"...an attacker can degenerate the hash table by sending lots of
>>>>colliding keys. ...making it possible to exhaust hours of CPU time
>>>>using
>>>>a single HTTP request."
>>>>To guard against this kind of attack you need a randomized element in
>>>>your hash (not a bad idea for Smalltalk, actually, and pretty easy --
>>>>mixing in the identity hash of the collection might be sufficient) or a
>>>>cryptographic hash (not worth the computational expense for most
>>>>purposes). However, even adding a randomized element would not prevent
>>>>this kind of attack if you predictably completely ignore some
>>>>characters
>>>>of the input string. That just makes it *so* easy to generate data that
>>>>validates, and is not equal, but causes collisions.
>>>>
>>>>So really, for general-purpose use (i.e. what's built into the
>>>>language)
>>>>hash *every* character of *all* strings. If someone finds that this is
>>>>a
>>>>performance problem in a real-world situation, it can be addressed in
>>>>an
>>>>application-specific way.
>>>
>>>
>>





--
_,,,^..^,,,_
best, Eliot


Reply | Threaded
Open this post in threaded view
|

Re: [Vm-dev] [Pharo-dev] Byte & String collection hash performance; a modest proposal for change.

Martin McClure-2
In reply to this post by Eliot Miranda-2
On 05/01/2017 11:55 AM, Eliot Miranda wrote:

>
>
> On Mon, May 1, 2017 at 10:37 AM, Tony Garnock-Jones <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     On 5/1/17 1:26 PM, Levente Uzonyi wrote:
>     > I presume that a general purpose in-image solution would be more
>     > complex. String already has too many subclasses (6 in Squeak), while at
>     > the same time other kind of new subclasses would be welcome too, e.g.
>     > Strings with 2-byte characters.
>     > Since these properties are orthogonal, there would be many new
>     > subclasses to cover all cases.
>
>     A classic motivating case for Traits, right?
>
>
> Alas no.  The problem with String's subclasses is that they're binary
> objects.  They have no inst vars in which one can cache a hash.  They're
> juts a flat vector of bytes.  If one added a trait to them the system
> would fail because there's no way to add an inst var to a binary object
> in the current Smalltalk object representation.  hence Levente's very
> clever idea of hiding the hash in a hidden header word.
>

That is a clever idea. I wouldn't actually *implement* that solution
though. It would add considerable complexity, considering that strings
are mutable and their hashes can change. And the problem that this would
solve is hypothetical isn't it? At least I haven't seen anyone pointing
to any actual applications that need to hash very large strings.

Regards,

-Martin


Reply | Threaded
Open this post in threaded view
|

Re: [Vm-dev] [Pharo-dev] Byte & String collection hash performance; a modest proposal for change.

Levente Uzonyi
In reply to this post by Eliot Miranda-2
On Mon, 1 May 2017, Eliot Miranda wrote:

>
>
> On Mon, May 1, 2017 at 5:11 PM, David T. Lewis <[hidden email]> wrote:
>       On Mon, May 01, 2017 at 07:26:35PM +0200, Levente Uzonyi wrote:
>       > I presume that a general purpose in-image solution would be more complex.
>       > String already has too many subclasses (6 in Squeak), while at the same
>       > time other kind of new subclasses would be welcome too, e.g. Strings with
>       > 2-byte characters.
>       > Since these properties are orthogonal, there would be many new subclasses
>       > to cover all cases.
>       > Storing the size of the string in a different header word is a VM specific
>       > detail, so I think caching the hash could also be hidden from the image.
>       >
>       > Levente
>
>       Actually, I meant something more like this:
>
>          Object subclass: #LargeString
>               instanceVariableNames: 'anyKindOfString cachedHashValueForTheString'
>               classVariableNames: ''
>               poolDictionaries: ''
>               category: 'Probably a Bad Idea'
>
>
>       I was guessing that hashing very large strings would imply a somewhat
>       specialized problem domain, for which a wrapper class might make sense.
>       Certainly it would not be a general solution.
>
>
> But this implies changing very many places in the VM that access a string as a vector of bytes.  That's why Levente suggests hiding the hash in unused memory in Spur.  It doesn't have the side effect of requiring a major rewrite.
I guess Dave didn't mean the VM to treat LargeString as other Strings.

Levente

>
>  
>
>       I am probably over my quota of bad ideas for today, so I'll stop now ;-)
>
>       Dave
>
>
>       >
>       > On Mon, 1 May 2017, David T. Lewis wrote:
>       >
>       > >
>       > >Does it need to be done in the VM? Why not make a class LargeString with
>       > >instance variables aString and myCalculatedHashValueForTheString. That way
>       > >you can cache the hash value and calculate it any way you want.
>       > >
>       > >I know vey little about hashing, just wondering if this kind of thing can
>       > >be handled more easily in the image.
>       > >
>       > >Dave
>       > >
>       > >
>       > >>
>       > >>On Mon, 1 May 2017, Levente Uzonyi wrote:
>       > >>
>       > >>>Well, I had started to write a reply, but I had to postpone it.
>       > >>>I mostly agree with your suggestions.
>       > >>>
>       > >>>One thing that can be done about large strings is to cache the
>       > >>>calculated
>       > >>>hash value in the larger strings. Currently the object representation
>       > >>>changes when the string contains 255 or more characters. In that case an
>       > >>>additional 64 bit field is added to the object header to store its
>       > >>>length.
>       > >>>If we were to use the upper 28+1 bits of that field to cache the hash,
>       > >>>there would still be 35-bits to encode length, which would be enough to
>       > >>>represent strings up to 8 GiB.
>       > >>
>       > >>Well, we can keep the whole range minus one bit, but then we can't store
>       > >>the hash for strings larger than 8 GiB.
>       > >>
>       > >>Levente
>       > >>
>       > >>>But this would require further VM changes (e.g. at:put: would have to
>       > >>>flush the cache).
>       > >>>
>       > >>>Levente
>       > >>>
>       > >>>On Mon, 1 May 2017, Martin McClure wrote:
>       > >>>
>       > >>>>I see no replies to this on any of the three lists it was sent to, so I
>       > >>>>guess I'll chime in.
>       > >>>>
>       > >>>>tl;dr:
>       > >>>>Making a primitive for #hashMultiply, probably a good idea, in some
>       > >>>>form, since doing it in Smalltalk is awkward.
>       > >>>>Only hashing every N-th character of large strings, probably a very bad
>       > >>>>idea. Performance might well get worse, the complexity does not seem
>       > >>>>justified, and it would open a sizeable security hole.
>       > >>>>
>       > >>>>More verbiage below for those interested.
>       > >>>>
>       > >>>>Regards,
>       > >>>>-Martin
>       > >>>>
>       > >>>>On 04/18/2017 07:09 PM, Eliot Miranda wrote:
>       > >>>>>Hi All,
>       > >>>>>
>       > >>>>>    the hash algorithm used for ByteString in Squeak and Pharo is good
>       > >>>>>for "small" strings and overkill for large strings.
>       > >>>>
>       > >>>>Why do you say it's overkill for large strings? Are there applications
>       > >>>>with large strings that are being negatively impacted by the current
>       > >>>>algorithm? Which ones, and impacted how?
>       > >>>>
>       > >>>>>It is important in many applications to get well distributed string
>       > >>>>>hashes, especially over the range of strings that constitute things
>       > >>>>>like method names, URLs, etc.  Consequently, the current algorithm
>       > >>>>>includes every character in a string.  This works very well for
>       > >>>>>"small" strings and results in very slow hashes (and hence long
>       > >>>>>latencies, because the hash is an uninterruptible primitive) for large
>       > >>>>>strings, where large may be several megabytes.
>       > >>>>
>       > >>>>A simple solution for the uninterruptable primitive is to not make it a
>       > >>>>primitive. Make #hashMultiply a primitive (since this particular kind
>       > >>>>of
>       > >>>>numeric modulo computation is really painful in Smalltalk), and do the
>       > >>>>rest in a loop in Smalltalk. It sounds like you've done the
>       > >>>>#hashMultiply primitive already.
>       > >>>>
>       > >>>>If the overhead of calling a primitive for each character proves to be
>       > >>>>too much, even with the faster primitive calling methodologies you
>       > >>>>talked about in the "Cog Primitive Performance" thread on the Vm-dev
>       > >>>>list, a more complex primitive could take a range of bytes, so large
>       > >>>>strings would be done in batches, solving the latency problem.
>       > >>>>
>       > >>>>>
>       > >>>>>Let's look at the basic hash algorithm.
>       > >>>>[...]
>       > >>>>>
>       > >>>>>In looking at this I've added a primitive for hashMultiply; primitive
>       > >>>>>#159 implements precisely self * 1664525 bitAnd: 16r0FFFFFFF for
>       > >>>>>SmallInteger and LargePositiveInteger receivers, as fast as possible
>       > >>>>>in the Cog JIT.  With this machinery in place it's instructive to
>       > >>>>>compare the cost of the primitive against the non-primitive Smalltalk
>       > >>>>>code.
>       > >>>>>
>       > >>>>>First let me introduce a set of replacement hash functions, newHashN.
>       > >>>>>These hash all characters in strings up to a certain size, and then no
>       > >>>>>more than that number for larger strings.  Here are newHash64 and
>       > >>>>>newHash2048, which use pure Smalltalk, including an inlined
>       > >>>>>hashMultiply written to avoid SmallInteger overflow.  Also measured
>       > >>>>>are the obvious variants newHash128, newHash256, newHash512 &
>       > >>>>>mewHash1024.
>       > >>>>>
>       > >>>>>
>       > >>>>[...]
>       > >>>>>So the idea here is to step through the string by 1 for strings sizes
>       > >>>>>up to N - 1, and by greater than 1 for strings of size >= N, limiting
>       > >>>>>the maximum number of characters sampled to between N // 2 and N - 1.
>       > >>>>
>       > >>>>The history of computing is littered with the bones of those who have
>       > >>>>tried this kind of thing. It doesn't end well. Yes, you get a faster
>       > >>>>hash function. And then you find, for sets of data that you or your
>       > >>>>users actually want to use, that you get collisions like crazy, and
>       > >>>>much
>       > >>>>worse overall performance than you started with.
>       > >>>>
>       > >>>>Sure, it works OK for the sets of data that the designer *tested*, and
>       > >>>>probably for the sets of data that they *anticipated*. But real-world
>       > >>>>data is tricky. It includes data sets where the characters that differ
>       > >>>>are the ones that the hash thinks are unimportant, and there goes your
>       > >>>>performance, by orders of magnitude. For instance, early versions of
>       > >>>>Java used a limited number of characters to hash strings. One of the
>       > >>>>biggest compatibility-breaking changes they were forced to make in
>       > >>>>later
>       > >>>>Java versions was to consider *all* characters in hashing. It turned
>       > >>>>out
>       > >>>>that it was very common to hash URLs, and many distinct URLs had most
>       > >>>>of
>       > >>>>their characters in common.
>       > >>>>
>       > >>>>And you don't always get to choose your data -- sometimes you have an
>       > >>>>opponent who is actively looking to create collisions as a
>       > >>>>denial-of-service attack. There was a fair-sized kerfluffle about this
>       > >>>>a
>       > >>>>few years ago -- most web scripting languages made it too easy to mount
>       > >>>>this kind of attack.
>       > >>>>
>       > >>>https://arstechnica.com/business/2011/12/huge-portions-of-web-vulnerable-to-hashing-denial-of-service-attack/
>       > >>>>"...an attacker can degenerate the hash table by sending lots of
>       > >>>>colliding keys. ...making it possible to exhaust hours of CPU time
>       > >>>>using
>       > >>>>a single HTTP request."
>       > >>>>To guard against this kind of attack you need a randomized element in
>       > >>>>your hash (not a bad idea for Smalltalk, actually, and pretty easy --
>       > >>>>mixing in the identity hash of the collection might be sufficient) or a
>       > >>>>cryptographic hash (not worth the computational expense for most
>       > >>>>purposes). However, even adding a randomized element would not prevent
>       > >>>>this kind of attack if you predictably completely ignore some
>       > >>>>characters
>       > >>>>of the input string. That just makes it *so* easy to generate data that
>       > >>>>validates, and is not equal, but causes collisions.
>       > >>>>
>       > >>>>So really, for general-purpose use (i.e. what's built into the
>       > >>>>language)
>       > >>>>hash *every* character of *all* strings. If someone finds that this is
>       > >>>>a
>       > >>>>performance problem in a real-world situation, it can be addressed in
>       > >>>>an
>       > >>>>application-specific way.
>       > >>>
>       > >>>
>       > >>
>       >
>
>
>
>
> --
> _,,,^..^,,,_
> best, Eliot
>
>