The Inbox: Collections-cbc.812.mcz

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

The Inbox: Collections-cbc.812.mcz

commits-2
Chris Cunningham uploaded a new version of Collections to project The Inbox:
http://source.squeak.org/inbox/Collections-cbc.812.mcz

==================== Summary ====================

Name: Collections-cbc.812
Author: cbc
Time: 15 November 2018, 10:06:05.244237 pm
UUID: 1977358c-ec82-5442-8f1d-027d4bc817a3
Ancestors: Collections-eem.811

Make intervals that are #= have the same hash.
Also, slight adjustment to #=.

=============== Diff against Collections-eem.811 ===============

Item was changed:
  ----- Method: Interval>>= (in category 'comparing') -----
  = anObject
-
  ^ self == anObject
+ or: [anObject isInterval
+ ifFalse: [super = anObject]
+ ifTrue:
+ [start = anObject first
+ and: [step = anObject increment
+ and: [self last = anObject last]]]]!
- ifTrue: [true]
- ifFalse: [anObject isInterval
- ifTrue: [start = anObject first
- and: [step = anObject increment
- and: [self last = anObject last]]]
- ifFalse: [super = anObject]]!

Item was changed:
  ----- Method: Interval>>hash (in category 'comparing') -----
  hash
  "Hash is reimplemented because = is implemented."
 
  ^(((start hash bitShift: 2)
+ bitOr: self last hash)
- bitOr: stop hash)
  bitShift: 1)
  bitOr: self size!


Reply | Threaded
Open this post in threaded view
|

Re: The Inbox: Collections-cbc.812.mcz

Levente Uzonyi
On Fri, 16 Nov 2018, [hidden email] wrote:

> Chris Cunningham uploaded a new version of Collections to project The Inbox:
> http://source.squeak.org/inbox/Collections-cbc.812.mcz
>
> Item was changed:
>  ----- Method: Interval>>hash (in category 'comparing') -----
>  hash
>   "Hash is reimplemented because = is implemented."
>
>   ^(((start hash bitShift: 2)
> + bitOr: self last hash)
> - bitOr: stop hash)
>   bitShift: 1)
>   bitOr: self size!

As Bert pointed out, #bitOr: is not a good choice for hashing, because
it loses entropy. Also, I'd use #hashMultiply instead of ad-hoc bit
shifts to maximize the entropy of the hash value. E.g.:

hash

  ^((start hash hashMultiply bitXor: self last hash) hashMultiply
  bitXor: self size) hashMultiply

Below is a snippet comparing various hash functions:

hashes := Set new. "Original #hash"
hashes2 := Set new. "#hash from Collections-cbc.812"
hashes3 := Set new. "The above proposed #hash using #hashMultiply and #bitXor:"
intervals := Set new. "The maximum number of different #hash values"

Interval allInstancesDo: [ :each |
  hashes add: each hash.
  hashes2 add: each hash2.
  hashes3 add: each hash3.
  intervals add: each ].
{
  hashes size.
  hashes2 size.
  hashes3 size.
  intervals size }

"==> #(584 584 1113 1113)"

The sample is too small, but it suggests that the original #hash and the
#hash in Collections-cbc.812 are suboptimal.
If we create a few more intervals in a workspace:

newIntervals := (1 to: 10000) collect: [ :each | (1 to: each) ].
newIntervals2 := (1 to: 10000) collect: [ :each | (1 to: each by: 2) ].
newIntervals3 := (1 to: 10000) collect: [ :each | (each to: 1 by: -1) ].

then the numbers will be more accurate:

#(6122 5368 26244 27043)

Levente

Reply | Threaded
Open this post in threaded view
|

Re: The Inbox: Collections-cbc.812.mcz

Chris Muller-3
> then the numbers will be more accurate:
>
> #(6122 5368 26244 27043)

That's a pretty dramatic improvement in the hash dispersion.

On one hand, we were just trying to fix the discrepency with #=, not
actually improve the #hash.  But, since we're in here anyway....

It would be a disruption if someone has them in a HashedCollection,
but probably minor since they can rehash, after which they should
enjoy better performance.

I do keep some large MagmaDictionary's which rely on the standard
#hash, but don't allow enumeration (due to their size), and so can't
really be rehashed except by rebuiding them.  But, if I have any
Intervals in them, I can probably deal with it.

So my guess is this is probably a worthwhile improvement.  I'll go
along with whatever y'all decide, but if its Levente's, please don't
forget to reparent to the trunk version.  :)  Much appreciated!

Best,
  Chris

cbc
Reply | Threaded
Open this post in threaded view
|

Re: The Inbox: Collections-cbc.812.mcz

cbc
Chris,
If/when we implement this, the idea is to rehash the collection using:
    HashedCollection rehashAll
Will this adverse affect your MagmaDictionary's?  Or will those be handled some other way? 

-cbc

On Fri, Nov 16, 2018 at 2:02 PM Chris Muller <[hidden email]> wrote:
> then the numbers will be more accurate:
>
> #(6122 5368 26244 27043)

That's a pretty dramatic improvement in the hash dispersion.

On one hand, we were just trying to fix the discrepency with #=, not
actually improve the #hash.  But, since we're in here anyway....

It would be a disruption if someone has them in a HashedCollection,
but probably minor since they can rehash, after which they should
enjoy better performance.

I do keep some large MagmaDictionary's which rely on the standard
#hash, but don't allow enumeration (due to their size), and so can't
really be rehashed except by rebuiding them.  But, if I have any
Intervals in them, I can probably deal with it.

So my guess is this is probably a worthwhile improvement.  I'll go
along with whatever y'all decide, but if its Levente's, please don't
forget to reparent to the trunk version.  :)  Much appreciated!

Best,
  Chris



Reply | Threaded
Open this post in threaded view
|

Re: The Inbox: Collections-cbc.812.mcz

Chris Muller-4
> Chris,
> If/when we implement this, the idea is to rehash the collection using:
>     HashedCollection rehashAll
> Will this adverse affect your MagmaDictionary's?

No, but it won't help them at all, either.

Does anyone know if ReferenceStream automatially rehashes Dictionary's
and Sets when it materializes them?  If not, then the impact of
changing the hash calculation is much higher than if they do.

Magma does, so no action should be needed for regular Sets and Dictionary's.

> Or will those be handled some other way?

However, Magma has another special dictionary called a MagmaDictionary
which is designed to be larger than RAM.  It maintains a reference to
its 'session', which communicates with the server to access the large
dictionary contents straight from the disk.  If I have Intervals in
any of these, I'll have to manually plan to rebuild them from scratch
with a utility script(s), because they don't support rehashing the way
small, in-memory HashedCollections do.

 - Chris

Reply | Threaded
Open this post in threaded view
|

Re: The Inbox: Collections-cbc.812.mcz

Eliot Miranda-2
Hi Chris,

On Fri, Nov 16, 2018 at 3:11 PM Chris Muller <[hidden email]> wrote:
> Chris,
> If/when we implement this, the idea is to rehash the collection using:
>     HashedCollection rehashAll
> Will this adverse affect your MagmaDictionary's?

No, but it won't help them at all, either.

Does anyone know if ReferenceStream automatially rehashes Dictionary's
and Sets when it materializes them?  If not, then the impact of
changing the hash calculation is much higher than if they do.

It has to, because:  Since Object>>#= and Object>>#hash fall back on #== and #identityHash, then Dictionary, Set et al instances are potentially affected by identity.  Therefore on reconstructing a Dictionary,. Set et al instance an unpickler must rehash (*).


(*) unless it verifies that all elements implement their own #= and #hash, which is intractable in practice; the only ways I can see of verifying that an object does not use #== or #identityHash in its #= and #hash methods are

a) to analyze the code (impossible for a non-AI unpickler) or,
b) construct a shallow copy of an object (since lots of #= implementations short-cut via "^self == other or: [...") and simulate #= and #hash, t5o see if #== or #identityHash is sent

Either of these would slow down unpicking enormously; rehashing invokes #= and #hash any way, but at full execution speed.


Magma does, so no action should be needed for regular Sets and Dictionary's.

> Or will those be handled some other way?

However, Magma has another special dictionary called a MagmaDictionary
which is designed to be larger than RAM.  It maintains a reference to
its 'session', which communicates with the server to access the large
dictionary contents straight from the disk.  If I have Intervals in
any of these, I'll have to manually plan to rebuild them from scratch
with a utility script(s), because they don't support rehashing the way
small, in-memory HashedCollections do.

 - Chris



--
_,,,^..^,,,_
best, Eliot