Chris Cunningham uploaded a new version of Collections to project The Inbox:
http://source.squeak.org/inbox/Collections-cbc.812.mcz ==================== Summary ==================== Name: Collections-cbc.812 Author: cbc Time: 15 November 2018, 10:06:05.244237 pm UUID: 1977358c-ec82-5442-8f1d-027d4bc817a3 Ancestors: Collections-eem.811 Make intervals that are #= have the same hash. Also, slight adjustment to #=. =============== Diff against Collections-eem.811 =============== Item was changed: ----- Method: Interval>>= (in category 'comparing') ----- = anObject - ^ self == anObject + or: [anObject isInterval + ifFalse: [super = anObject] + ifTrue: + [start = anObject first + and: [step = anObject increment + and: [self last = anObject last]]]]! - ifTrue: [true] - ifFalse: [anObject isInterval - ifTrue: [start = anObject first - and: [step = anObject increment - and: [self last = anObject last]]] - ifFalse: [super = anObject]]! Item was changed: ----- Method: Interval>>hash (in category 'comparing') ----- hash "Hash is reimplemented because = is implemented." ^(((start hash bitShift: 2) + bitOr: self last hash) - bitOr: stop hash) bitShift: 1) bitOr: self size! |
On Fri, 16 Nov 2018, [hidden email] wrote:
> Chris Cunningham uploaded a new version of Collections to project The Inbox: > http://source.squeak.org/inbox/Collections-cbc.812.mcz > > Item was changed: > ----- Method: Interval>>hash (in category 'comparing') ----- > hash > "Hash is reimplemented because = is implemented." > > ^(((start hash bitShift: 2) > + bitOr: self last hash) > - bitOr: stop hash) > bitShift: 1) > bitOr: self size! As Bert pointed out, #bitOr: is not a good choice for hashing, because it loses entropy. Also, I'd use #hashMultiply instead of ad-hoc bit shifts to maximize the entropy of the hash value. E.g.: hash ^((start hash hashMultiply bitXor: self last hash) hashMultiply bitXor: self size) hashMultiply Below is a snippet comparing various hash functions: hashes := Set new. "Original #hash" hashes2 := Set new. "#hash from Collections-cbc.812" hashes3 := Set new. "The above proposed #hash using #hashMultiply and #bitXor:" intervals := Set new. "The maximum number of different #hash values" Interval allInstancesDo: [ :each | hashes add: each hash. hashes2 add: each hash2. hashes3 add: each hash3. intervals add: each ]. { hashes size. hashes2 size. hashes3 size. intervals size } "==> #(584 584 1113 1113)" The sample is too small, but it suggests that the original #hash and the #hash in Collections-cbc.812 are suboptimal. If we create a few more intervals in a workspace: newIntervals := (1 to: 10000) collect: [ :each | (1 to: each) ]. newIntervals2 := (1 to: 10000) collect: [ :each | (1 to: each by: 2) ]. newIntervals3 := (1 to: 10000) collect: [ :each | (each to: 1 by: -1) ]. then the numbers will be more accurate: #(6122 5368 26244 27043) Levente |
> then the numbers will be more accurate:
> > #(6122 5368 26244 27043) That's a pretty dramatic improvement in the hash dispersion. On one hand, we were just trying to fix the discrepency with #=, not actually improve the #hash. But, since we're in here anyway.... It would be a disruption if someone has them in a HashedCollection, but probably minor since they can rehash, after which they should enjoy better performance. I do keep some large MagmaDictionary's which rely on the standard #hash, but don't allow enumeration (due to their size), and so can't really be rehashed except by rebuiding them. But, if I have any Intervals in them, I can probably deal with it. So my guess is this is probably a worthwhile improvement. I'll go along with whatever y'all decide, but if its Levente's, please don't forget to reparent to the trunk version. :) Much appreciated! Best, Chris |
Chris, If/when we implement this, the idea is to rehash the collection using: HashedCollection rehashAll Will this adverse affect your MagmaDictionary's? Or will those be handled some other way? -cbc On Fri, Nov 16, 2018 at 2:02 PM Chris Muller <[hidden email]> wrote: > then the numbers will be more accurate: |
> Chris,
> If/when we implement this, the idea is to rehash the collection using: > HashedCollection rehashAll > Will this adverse affect your MagmaDictionary's? No, but it won't help them at all, either. Does anyone know if ReferenceStream automatially rehashes Dictionary's and Sets when it materializes them? If not, then the impact of changing the hash calculation is much higher than if they do. Magma does, so no action should be needed for regular Sets and Dictionary's. > Or will those be handled some other way? However, Magma has another special dictionary called a MagmaDictionary which is designed to be larger than RAM. It maintains a reference to its 'session', which communicates with the server to access the large dictionary contents straight from the disk. If I have Intervals in any of these, I'll have to manually plan to rebuild them from scratch with a utility script(s), because they don't support rehashing the way small, in-memory HashedCollections do. - Chris |
Hi Chris,
On Fri, Nov 16, 2018 at 3:11 PM Chris Muller <[hidden email]> wrote: > Chris, It has to, because: Since Object>>#= and Object>>#hash fall back on #== and #identityHash, then Dictionary, Set et al instances are potentially affected by identity. Therefore on reconstructing a Dictionary,. Set et al instance an unpickler must rehash (*). (*) unless it verifies that all elements implement their own #= and #hash, which is intractable in practice; the only ways I can see of verifying that an object does not use #== or #identityHash in its #= and #hash methods are a) to analyze the code (impossible for a non-AI unpickler) or, b) construct a shallow copy of an object (since lots of #= implementations short-cut via "^self == other or: [...") and simulate #= and #hash, t5o see if #== or #identityHash is sent Either of these would slow down unpicking enormously; rehashing invokes #= and #hash any way, but at full execution speed.
_,,,^..^,,,_ best, Eliot |
Free forum by Nabble | Edit this page |