Hi all,
I am trying to investigate performance overhead in benchmarks to improve the VM performance. In a benchmark strings are used as dictionary keys, which seems to be an OK pattern to me that may be actually used in real code from time to time. Something stroke me: the hash function of a string is actually iterating over the whole string.
As a result, I get something like this: #( 'String size: 0, Hash function performance: 17,817,776 per second' 'String size: 1, Hash function performance: 15,395,388 per second' 'String size: 3, Hash function performance: 14,853,109 per second' 'String size: 4, Hash function performance: 15,151,954 per second' 'String size: 5, Hash function performance: 14,139,881 per second' 'String size: 10, Hash function performance: 11,545,749 per second' 'String size: 65, Hash function performance: 3,431,067 per second' 'String size: 130, Hash function performance: 1,879,047 per second' 'String size: 520, Hash function performance: 511,934 per second' ) This means that if a hash of a 1MB String is computed it takes a hell of a time. I wonder why we don't compute the string hash from a small number of inner characters taken randomly but deterministically from the String. From example, for all strings longer than 5 characters, one could pick only 5 characters in the String to compute its hash. The hash function is this way decent enough and the performance does not drop for large strings. One implementation could be to replace: 1 to: stringSize do: [:pos | by: 1 to: stringSize by: stringSize // 5 do: [:pos | in: ByteString>>stringHash: aString initialHash: speciesHash Then we need to rehash the whole image. What do you think ? Is there a hash expert around that can confirm this is a good idea ? |
> On 13-04-2017, at 2:19 PM, Clément Bera <[hidden email]> wrote: > > > This means that if a hash of a 1MB String is computed it takes a hell of a time. > > I wonder why we don't compute the string hash from a small number of inner characters taken randomly but deterministically from the String. From example, for all strings longer than 5 characters, one could pick only 5 characters in the String to compute its hash. Yay! We haven’t had a good hash function argument in *ages*! tim -- tim Rowledge; [hidden email]; http://www.rowledge.org/tim Klingon Code Warrior:- 5) "Specs are for the weak and timid!" |
In reply to this post by Clément Béra
Hi Clément, On Thu, 13 Apr 2017, Clément Bera wrote: > What do you think ? Is there a hash expert around that can confirm this is a good idea ? I'm not a hash expert. But while I'm sure it's tempting to make this change, I'm also sure it won't be a good idea. Why? What makes a hash function good? A good hash function is expected to give different values for similar, but not equal inputs. If you omit information, as you proposed, you can't satisfy this requirement. Let's say we have strings of length 10, because those are the shortest strings affected by your proposal. Let the strings have the form 'ProductXXXX' where XXXX are numbers identifying a given product type. Let's see what would happen with the current implementation (step=1) and your suggested change (step=2) to the different hash values (code has extra complexity to be cross-dialect): #(1 2) collect: [ :step | step -> ((0 to: 9999) collect: [ :index | | aString speciesHash stringSize hash low | aString := 'Product', ('0000' first: (4 - (index numberOfDigitsInBase: 10))), index asString. speciesHash := ByteString identityHash. stringSize := aString size. hash := speciesHash bitAnd: 16rFFFFFFF. 1 to: stringSize by: step do: [:pos | hash := hash + (aString basicAt: pos). "Begin hashMultiply" low := hash bitAnd: 16383. hash := (16r260D * low + ((16r260D * (hash bitShift: -14) + (16r0065 * low) bitAnd: 16383) * 16384)) bitAnd: 16r0FFFFFFF ]. hash ] as: Set) size ] The result is: {1->10000. 2->100}. So the number of different hash values went down from the optimal 10000 to 100 - aka 1%. What's even worse is that the problem exists if you take any small consecutive range, only the ratio changes. So, IMO the implementation should stay as is - the hash should include information of all bytes of the string. The good news is that there is PluggableDictionary, which allows you to implement your suggestion on the image side. Actually, if we had a really good JIT - the overhead of using blocks were negligible, then we could just make Dictionary itself pluggable and collapse large parts of the HashedCollection hierarchy. Levente |
In reply to this post by Clément Béra
That would radically increase the number of collisions from the hash function. I don’t think that is a good idea. *Especially* when we are talking about a hash function that will be used for dictionaries. When you test the performance of a hash function you should also test it’s properties w.r.t. to hashed collections (Andrés Valloud explains that nicely in “Hashing in Smalltalk”).
Cheers, Max
|
However, since Symbol are unique, we could have a different hashing policy. Like for example pinning Symbol and then using identity hash.It would probably complexify the handing of memory, but that's the deal. 2017-04-14 6:05 GMT+02:00 Max Leske <[hidden email]>:
|
2017-04-14 8:03 GMT+02:00 Nicolas Cellier <[hidden email]>:
oops forget it, the identity hash does not rely on object position in memory, so no need pinning.
|
In reply to this post by Clément Béra
I had an application where I needed to store arrays into sets and performance depended heavily on this operation. Standard array classes were unacceptable because hashing was either too poor (then) or too expensive (now). So I created my own array class (say HashArray, I forget the name now). HashArray stored a hash value rather than compute one. The hash value of an array was the XOR of its elements (which in my application have good hash values). This worked very well but came at a price: each time an element was added or removed or changed for a HashArray instance the hash value had to be recomputed (O(1) cost). For my application this wasn't a problem so my solution worked well. Notes: 1) In my case two arrays containing the same elements should have the same hash value. Your situation might be different. If so you could also add the the hash value some value associated with the array itself. 2) While contained in a set the array couldn't be changed (didn't happen in my application) but this is true for most objects anyway since its has value cannot change; just something to be aware of. 3) In my application the elements of the array also stored their hash values. The has values were generated using a random number generator. This could be done in my application but obviously can't always be done. This gave very good hashing for my arrays. If you are having performance problems with array classes because of hashing perhaps you should try a scheme similar to mine. Whether or not such a class should be part of Squeak, I don't know. Perhaps there should be a package of HashArray like classes that could be loaded. Ralph Boland |
In reply to this post by Levente Uzonyi
> So the number of different hash values went down from the optimal 10000 to 100 - aka 1%. What's even worse is that the problem exists if you take any small consecutive range, only the ratio changes. The speed tradeoff seems most acute for large chunks of text. If the original strategy were used for strings of size less than (say) 50 and the "sampling" strategy were used for longer strings, with the string length included in the hash, then a large chunk of text with one character added would likely not hash collide. A large string scale app could tune the hash function as has been suggested if performance were poor and tuning would be done by someone likely to be oriented to the problem, while performance would be good in the general case. $0.02 -KenD
-KenD
|
In reply to this post by Clément Béra
On 04/13/2017 02:19 PM, Clément Bera wrote: > I wonder why we don't compute the string hash from a small number of > inner characters taken randomly but deterministically from the String. > From example, for all strings longer than 5 characters, one could pick > only 5 characters in the String to compute its hash. The hash function > is this way decent enough and the performance does not drop for large > strings. > > One implementation could be to replace: > 1 to: stringSize do: [:pos | > by: > 1 to: stringSize by: stringSize // 5 do: [:pos | > in: > ByteString>>stringHash: aString initialHash: speciesHash > > Then we need to rehash the whole image. > > What do you think ? Is there a hash expert around that can confirm this > is a good idea ? This kind of thing has been tried before, and failed miserably. One of the biggest breaking incompatibilities, IIRC, between Java 1 and Java 2 was that they had to change their string hash function to hash all characters, because their scheme of hashing only a subset of characters had a huge number of collisions in real-world string like URLs, which have a lot of characters in common. I think it's worth the time to hash all characters. Regards, -Martin |
In reply to this post by Clément Béra
IME, being smart about which characters to hash quickly runs into data sets being smarter about hiding variation in the unsampled characters. Slower hash functions aren't necessarily bad. Practical experience: 1. Investment bank end-of-day report takes 30 minutes to run. The time is spent mostly in linear search because of hash collisions. Introduced a 5x slower hash function that had a collision rate of about 2.5 instead of 20+. The report now takes 90 seconds, and hashing is nowhere to be seen in the profiler output. No need to optimize hashing further. 2. VisualWorks string / symbol hashes were also "smart". Replaced with a multiplicative hash along the lines of the Horner rule, which is slower for large enough strings. But interning all symbols into a new symbol table now takes 7x less time, and one of the k-nucleotide-related Computer Language Shootout benchmarks now runs in half the time. Regarding huge strings, surely a 1mb string has a special meaning that could be encapsulated within an object, and it is *this* object that could provide a hash function suitable for the purpose at hand. In fact, especially in those cases, the returned hash value might not even need to be based on the 1mb string at all. On 4/13/17 14:19 , Clément Bera wrote: > > > > > Hi all, > > I am trying to investigate performance overhead in benchmarks to improve > the VM performance. In a benchmark strings are used as dictionary keys, > which seems to be an OK pattern to me that may be actually used in real > code from time to time. Something stroke me: _the hash function of a > string is actually iterating over the whole string. _ > > As a result, I get something like this: > > #( > 'String size: 0, Hash function performance: 17,817,776 per second' > 'String size: 1, Hash function performance: 15,395,388 per second' > 'String size: 3, Hash function performance: 14,853,109 per second' > 'String size: 4, Hash function performance: 15,151,954 per second' > 'String size: 5, Hash function performance: 14,139,881 per second' > 'String size: 10, Hash function performance: 11,545,749 per second' > 'String size: 65, Hash function performance: 3,431,067 per second' > 'String size: 130, Hash function performance: 1,879,047 per second' > 'String size: 520, Hash function performance: 511,934 per second' > ) > > This means that if a hash of a 1MB String is computed it takes a hell of > a time. > > I wonder why we don't compute the string hash from a small number of > inner characters taken randomly but deterministically from the String. > From example, for all strings longer than 5 characters, one could pick > only 5 characters in the String to compute its hash. The hash function > is this way decent enough and the performance does not drop for large > strings. > > One implementation could be to replace: > 1 to: stringSize do: [:pos | > by: > 1 to: stringSize by: stringSize // 5 do: [:pos | > in: > ByteString>>stringHash: aString initialHash: speciesHash > > Then we need to rehash the whole image. > > What do you think ? Is there a hash expert around that can confirm this > is a good idea ? |
On Sat, Apr 15, 2017 at 3:34 AM, Andres Valloud <[hidden email]> wrote:
This seems like the best practical advice. What are opinions on having a HashedString subclassed from String to optimise calculating hash only once? a. In the standard Image or leave it specific to applications? b. Obviously an ivar to store the hash computed at instantiation. c. Maybe have an ivar storing a custom hash-block? Equality would check this is identical between objects. cheers -ben
|
If it were up to me, I'd leave that decision to the applications. No need to optimize unknown future problems. Also keep in mind applications likely should refine strings by composition and delegation rather than by inheritance in this particular case. Otherwise, String Base64EncodedMP3File PostgresqlDataBlob SmalltalkMethodSourceCode ... and it stops making sense quickly. On 4/14/17 16:12 , Ben Coman wrote: > What are opinions on having a HashedString subclassed from String to > optimise calculating hash only once? > > a. In the standard Image or leave it specific to applications? > > b. Obviously an ivar to store the hash computed at instantiation. > > c. Maybe have an ivar storing a custom hash-block? Equality would > check this is identical between objects. |
> On 14-04-2017, at 4:38 PM, Andres Valloud <[hidden email]> wrote: > > If it were up to me, I'd leave that decision to the applications. No need to optimize unknown future problems. Exactly; remember the rules for optimising - 1) Don’t. 2) (for experts only) Don’t, *yet*. tim -- tim Rowledge; [hidden email]; http://www.rowledge.org/tim Useful Latin Phrases:- O! Plus! Perge! Aio! Hui! Hem! = Oh! More! Go on! Yes! Ooh! Ummm! |
Free forum by Nabble | Edit this page |