Hi guys. I hope this is not a very stupid question. Background: in Fuel we have a IdentityDictionary where we put each of the objects we find while traversing the graph. We need to use IdentitySet because we cannot have repetitions (and to avoid loops) so we NEED to use #==. In such dictionary we put ALL objects of the graph, so it can be very big. Since IdentitySet uses #identityHash, it means it will be using those ONLY 12 bits in the object header. It means that we have 2^12 = 4096 different values.
Question: having explained the previous, I wanted to be able to use #hash rather than #identityHash since several classes implement #hash and hence I thought that using #hash I could have less colisions and hence a better performance. I tried to make a subclass of IdentitySet that uses #hash rather than #identityHash but my image freezes. I also tried something like: set := PluggableSet new. set hashBlock: [ :elem | elem hash ]. set equalBlock: [ :a :b | a == b ]. But it doesn't work either. I works with simple tests in a workspace but when I run the full tests of Fuel, it enters in a loop in the method #hash of Collection.. Anyway, my question is, should that work? if not, what is the exact reason? Thanks in advance, -- Mariano http://marianopeck.wordpress.com |
On Mon, 12 Dec 2011, Mariano Martinez Peck wrote:
> Hi guys. I hope this is not a very stupid question. Background: in Fuel we > have a IdentityDictionary where we put each of the objects we find while > traversing the graph. We need to use IdentitySet because we cannot have > repetitions (and to avoid loops) so we NEED to use #==. In such dictionary > we put ALL objects of the graph, so it can be very big. Since IdentitySet > uses #identityHash, it means it will be using those ONLY 12 bits in the > object header. It means that we have 2^12 = 4096 different values. > > Question: having explained the previous, I wanted to be able to use #hash > rather than #identityHash since several classes implement #hash and hence I > thought that using #hash I could have less colisions and hence a better > performance. I tried to make a subclass of IdentitySet that uses #hash > rather than #identityHash but my image freezes. I also tried something like: > > set := PluggableSet new. > set hashBlock: [ :elem | elem hash ]. > set equalBlock: [ :a :b | a == b ]. > > But it doesn't work either. I works with simple tests in a workspace but > when I run the full tests of Fuel, it enters in a loop in the method #hash > of Collection.. > > Anyway, my question is, should that work? if not, what is the exact reason? If the serialized objects remain unchanged, then it can work. You can also give another IdentitySet implementation a try, e.g.: http://leves.web.elte.hu/squeak/LargeIdentitySet.st . Levente > > Thanks in advance, > > -- > Mariano > http://marianopeck.wordpress.com > |
On Mon, 12 Dec 2011, Levente Uzonyi wrote:
> On Mon, 12 Dec 2011, Mariano Martinez Peck wrote: > >> Hi guys. I hope this is not a very stupid question. Background: in Fuel we >> have a IdentityDictionary where we put each of the objects we find while >> traversing the graph. We need to use IdentitySet because we cannot have >> repetitions (and to avoid loops) so we NEED to use #==. In such dictionary >> we put ALL objects of the graph, so it can be very big. Since IdentitySet >> uses #identityHash, it means it will be using those ONLY 12 bits in the >> object header. It means that we have 2^12 = 4096 different values. >> >> Question: having explained the previous, I wanted to be able to use #hash >> rather than #identityHash since several classes implement #hash and hence I >> thought that using #hash I could have less colisions and hence a better >> performance. I tried to make a subclass of IdentitySet that uses #hash >> rather than #identityHash but my image freezes. I also tried something >> like: >> >> set := PluggableSet new. >> set hashBlock: [ :elem | elem hash ]. >> set equalBlock: [ :a :b | a == b ]. >> >> But it doesn't work either. I works with simple tests in a workspace but >> when I run the full tests of Fuel, it enters in a loop in the method #hash >> of Collection.. >> >> Anyway, my question is, should that work? if not, what is the exact reason? > > If the serialized objects remain unchanged, then it can work. You can also > give another IdentitySet implementation a try, e.g.: > http://leves.web.elte.hu/squeak/LargeIdentitySet.st . You can also take a look at SystemTracer, because (IIRC) it uses an IdentitySet which will contain all objects at a point, and it's reasonably fast. IIRC the idea is to use the identityHash of the object mixed with the hash of its class to increase the number of bits in the key. Levente > > > Levente > >> >> Thanks in advance, >> >> -- >> Mariano >> http://marianopeck.wordpress.com >> > > |
In reply to this post by Mariano Martinez Peck
Hi
Wouldn't the fact that you use hash cause potential loops now? e.g. collection refers to another object that refers to first collection. --> aCollection>>hash references an item which causes this current collection's hash to be called again? identityHash is deterministic in this case. Does this help? Cheers Carlo On 12 Dec 2011, at 10:58 AM, Mariano Martinez Peck wrote: Hi guys. I hope this is not a very stupid question. Background: in Fuel we have a IdentityDictionary where we put each of the objects we find while traversing the graph. We need to use IdentitySet because we cannot have repetitions (and to avoid loops) so we NEED to use #==. In such dictionary we put ALL objects of the graph, so it can be very big. Since IdentitySet uses #identityHash, it means it will be using those ONLY 12 bits in the object header. It means that we have 2^12 = 4096 different values. Question: having explained the previous, I wanted to be able to use #hash rather than #identityHash since several classes implement #hash and hence I thought that using #hash I could have less colisions and hence a better performance. I tried to make a subclass of IdentitySet that uses #hash rather than #identityHash but my image freezes. I also tried something like: set := PluggableSet new. set hashBlock: [ :elem | elem hash ]. set equalBlock: [ :a :b | a == b ]. But it doesn't work either. I works with simple tests in a workspace but when I run the full tests of Fuel, it enters in a loop in the method #hash of Collection.. Anyway, my question is, should that work? if not, what is the exact reason? Thanks in advance, -- Mariano http://marianopeck.wordpress.com |
On Mon, Dec 12, 2011 at 1:56 PM, Carlo <[hidden email]> wrote:
Hi Carlo. I am still newbie with Collections but I think I am having exactly that problem. During my tests, it loops in Collection >> #hash when sending #hash to its elements. Sorry, but I couldn't undertand what is the cause of the problem? why it doesn't work while it does using #identityHash? could you elaborate? thanks
-- Mariano http://marianopeck.wordpress.com |
In reply to this post by Levente Uzonyi-2
Thanks Levente. I will try both. Actually, I would also need a fast (or at least I would like to use #hash) IdentityDictionary apart from a IdentitySet ;)
On Mon, Dec 12, 2011 at 1:52 PM, Levente Uzonyi <[hidden email]> wrote:
-- Mariano http://marianopeck.wordpress.com |
In reply to this post by Mariano Martinez Peck
On Tue, Dec 13, 2011 at 12:32 AM, Mariano Martinez Peck <[hidden email]> wrote:
Well, now I understood, and I understand also why it doesn't happen with #identityHash. But what happens then with regular Dictionaries using #hash? why it doesn't happen there?
-- Mariano http://marianopeck.wordpress.com |
Hi Mariano I'm no expert either ;) Without having access to the exact code it would look like either you have a collection that references itself (which would break all collection implementations) or maybe the tests have just slowed down to the point where you think it's 'crashed'. Do you have anymore info or perhaps which methods you changed on IdentitySet? Cheers Carlo On 13 Dec 2011, at 1:57 AM, Mariano Martinez Peck wrote: On Tue, Dec 13, 2011 at 12:32 AM, Mariano Martinez Peck <[hidden email]> wrote:
Well, now I understood, and I understand also why it doesn't happen with #identityHash. But what happens then with regular Dictionaries using #hash? why it doesn't happen there?
-- Mariano http://marianopeck.wordpress.com |
Hi Mariano, when I read this thread I was a bit confused that you wanted an IdentitySet that used #hash. From that statement it sounded like you just wanted a Set. This would allow any object to define its own hash and importantly what equality means with #=. So if you want to delegate that to the object use a Set.
However as the thread has gone on perhaps you want the identity relationship but you just wanted a bigger identity hash space? IdentitySets are fast because they bypass any delegation. Once you have seen the object in your traversal (common usage pattern) that's it. You grab it's identity which is a pretty low level thing to do. As for what happens with collections who knows. Depends. Relying on identity set semantics for a collection is easy. Set semantics is not so. Remember that both hash and equals are important to know if the set already contains the element. Depending on the collection implementation both of these could be composite in terms of the parts. Who knows where you end or how long it takes. I.e. if it is a function of the size of the collection and further collections are composite.... Cheers Mike On Tuesday, December 13, 2011, Carlo <[hidden email]> wrote: > Hi Mariano > I'm no expert either ;) > Without having access to the exact code it would look like either you have a collection that references itself (which would break all collection implementations) or maybe the tests have just slowed down to the point where you think it's 'crashed'. > Do you have anymore info or perhaps which methods you changed on IdentitySet? > Cheers > Carlo > On 13 Dec 2011, at 1:57 AM, Mariano Martinez Peck wrote: > > > On Tue, Dec 13, 2011 at 12:32 AM, Mariano Martinez Peck <[hidden email]> wrote: >> >> >> On Mon, Dec 12, 2011 at 1:56 PM, Carlo <[hidden email]> wrote: >>> >>> Hi >>> Wouldn't the fact that you use hash cause potential loops now? e.g. collection refers to another object that refers to first collection. --> aCollection>>hash references an item which causes this current collection's hash to be called again? >> >> Hi Carlo. I am still newbie with Collections but I think I am having exactly that problem. During my tests, it loops in Collection >> #hash when sending #hash to its elements. >> Sorry, but I couldn't undertand what is the cause of the problem? why it doesn't work while it does using #identityHash? could you elaborate? >> > > Well, now I understood, and I understand also why it doesn't happen with #identityHash. But what happens then with regular Dictionaries using #hash? why it doesn't happen there? > >> >> thanks >> >>> >>> identityHash is deterministic in this case. >>> Does this help? >>> Cheers >>> Carlo >>> On 12 Dec 2011, at 10:58 AM, Mariano Martinez Peck wrote: >>> Hi guys. I hope this is not a very stupid question. Background: in Fuel we have a IdentityDictionary where we put each of the objects we find while traversing the graph. We need to use IdentitySet because we cannot have repetitions (and to avoid loops) so we NEED to use #==. In such dictionary we put ALL objects of the graph, so it can be very big. Since IdentitySet uses #identityHash, it means it will be using those ONLY 12 bits in the object header. It means that we have 2^12 = 4096 different values. >>> >>> Question: having explained the previous, I wanted to be able to use #hash rather than #identityHash since several classes implement #hash and hence I thought that using #hash I could have less colisions and hence a better performance. I tried to make a subclass of IdentitySet that uses #hash rather than #identityHash but my image freezes. I also tried something like: >>> >>> set := PluggableSet new. >>> set hashBlock: [ :elem | elem hash ]. >>> set equalBlock: [ :a :b | a == b ]. >>> >>> But it doesn't work either. I works with simple tests in a workspace but when I run the full tests of Fuel, it enters in a loop in the method #hash of Collection.. >>> >>> Anyway, my question is, should that work? if not, what is the exact reason? >>> >>> Thanks in advance, >>> >>> -- >>> Mariano >>> http://marianopeck.wordpress.com >>> > > -- > Mariano > http://marianopeck.wordpress.com > > > |
On Tue, Dec 13, 2011 at 8:43 AM, Michael Roberts <[hidden email]> wrote: Hi Mariano, when I read this thread I was a bit confused that you wanted an IdentitySet that used #hash. From that statement it sounded like you just wanted a Set. This would allow any object to define its own hash and importantly what equality means with #=. So if you want to delegate that to the object use a Set. No, I cannot use a Set because I cannot have repetitions. This that I am asking is what we do in Fuel serializer while traversing the object graph. Each analyzed object is put in an IdentityDictionary (but the question is the same for IdentitySet) as key. To avoid cycles, I need to put each object only once. Since graphs can be very big, such dict could have lots of objects. However as the thread has gone on perhaps you want the identity relationship but you just wanted a bigger identity hash space? Yes, exactly. I were thinking if there could be a way (maybe...that's why I am asking) of improve its performance considering that I could have much more objects than 2^13. In other words, I wanted to see if I could avoid colisions.
Ok...it is a tradeoff. If I use #identityHash it is fast because there is no delegation and it is almost an immediate primitive. But I gues it will be slow if there are lots of colisions. Not using #identityHash but something else maybe could decrease maybe the amount of colisions, but maybe with the delegation it will gets slower. I will try with Levente idea of what it is done in SystemTracer: use as a hash the identityHash of the object mixed with the identityHash of its class. Maybe that decreases the colisions and at the same time I don't pay delegation (#class is special bytecode, so nothing, and ok..there are 2 sends to #identityhash but I don't thinnk it is that much). Anyway, thanks for the interesting post, I always learn :)
yes... Cheers -- Mariano http://marianopeck.wordpress.com |
Ok I see. Have you measured the collisions on a very big graph? What stats do you get?
|
In reply to this post by Mariano Martinez Peck
Hi Mariano,
On Tue, Dec 13, 2011 at 12:37 AM, Mariano Martinez Peck <[hidden email]> wrote:
You can assume that certain properties of objects will not change during serialization, for example the class of objects, the basic size of objects. So you can construct a valid extended identity hash from these properties. For example
fuelSerializationHash ^self identityHash + (self class identityHash bitShift: 12) + (self basicSize bitShift: 24) In general this idea may not work because of meta-primitives like changeClassTo:, which would change the hash. But in Fuel's case I think it's safe to assume that objects won't change class or size during serialization.
HTH
best, Eliot |
On Tue, Dec 13, 2011 at 8:44 PM, Eliot Miranda <[hidden email]> wrote: Hi Mariano, Thanks!!! that's exactly what I wanted to try :)
Exactly :) Moreoever, if objects change their state or shape, or, in other words, in the graph changes while we are in the middle of the serialization, we will be screw anyway....whether the hash has changed or not :) I will see what is worst, if the colisions or having to send 2 times #identityHash and 1 #basicSize. Thanks!
-- Mariano http://marianopeck.wordpress.com |
Well...it is much slower :( it seems that the cost of (aKey identityHash + ( aKey mareaClass identityHash bitShift: 12) + (aKey basicSize bitShift: 24)
is bigger than the colisions. Anyway, thanks for the nice thread. I learned. Cheers On Tue, Dec 13, 2011 at 8:51 PM, Mariano Martinez Peck <[hidden email]> wrote:
-- Mariano http://marianopeck.wordpress.com |
Collision free hashing based on fibonacci pseudo primes? 25 years old ... :-) On Dec 15, 2011 11:35 PM, "Mariano Martinez Peck" <[hidden email]> wrote: |
In reply to this post by Mariano Martinez Peck
On 15.12.2011 23:35, Mariano Martinez Peck wrote:
> Well...it is much slower :( it seems that the cost of (aKey > identityHash + ( aKey mareaClass identityHash bitShift: 12) + (aKey > basicSize bitShift: 24) > is bigger than the colisions. > Anyway, thanks for the nice thread. I learned. > > Cheers > Well, first of all, that always creates a largeInteger, since identityHash is already shifted by 22... You'd want to be using basicIdentityHash. 2nd off, basicSize is useless for non-variable classes, which I guess is the common case. 3rd, putting class hash in the high bits won't really help much if you're serializing many instances of the same class, as they will all occupy a sequential hash range. So if I were you, I'd try with something like obj basicIdentityHash << 12 + (obj class basicIdentityHash) before discarding it outright due to computation cost. Cheers, Henry |
In reply to this post by Mariano Martinez Peck
On 16.12.2011 00:40, Henrik Sperre Johansen wrote:
> On 15.12.2011 23:35, Mariano Martinez Peck wrote: >> Well...it is much slower :( it seems that the cost of (aKey >> identityHash + ( aKey mareaClass identityHash bitShift: 12) + (aKey >> basicSize bitShift: 24) >> is bigger than the colisions. >> Anyway, thanks for the nice thread. I learned. >> >> Cheers >> > Well, first of all, that always creates a largeInteger, since > identityHash is already shifted by 22... You'd want to be using > basicIdentityHash. > 2nd off, basicSize is useless for non-variable classes, which I guess > is the common case. > 3rd, putting class hash in the high bits won't really help much if > you're serializing many instances of the same class, as they will all > occupy a sequential hash range. > > So if I were you, I'd try with something like > obj basicIdentityHash << 12 + (obj class basicIdentityHash) > before discarding it outright due to computation cost. > > Cheers, > Henry > They kept the old version of identityHash (renamed to basicIdentityHash in Pharo), and introduced scaledIdentityHash (doing the same as identityHash in Pharo), due to backwards compatability concerns for users explicitly relying on #identityHash returning a value in range [0, 4095]. Cheers, Henry |
In reply to this post by Mariano Martinez Peck
On 16.12.2011 00:40, Henrik Sperre Johansen wrote:
> On 15.12.2011 23:35, Mariano Martinez Peck wrote: >> Well...it is much slower :( it seems that the cost of (aKey >> identityHash + ( aKey mareaClass identityHash bitShift: 12) + (aKey >> basicSize bitShift: 24) >> is bigger than the colisions. >> Anyway, thanks for the nice thread. I learned. >> >> Cheers >> > Well, first of all, that always creates a largeInteger, since > identityHash is already shifted by 22... You'd want to be using > basicIdentityHash. > 2nd off, basicSize is useless for non-variable classes, which I guess > is the common case. > 3rd, putting class hash in the high bits won't really help much if > you're serializing many instances of the same class, as they will all > occupy a sequential hash range. > > So if I were you, I'd try with something like > obj basicIdentityHash << 12 + (obj class basicIdentityHash) > before discarding it outright due to computation cost. > > Cheers, > Henry > 1) don't use PluggableSets. Block activations are costly, and you REALLY need specialized data/hash functions for it to be worth using. 2) Even with a subclass of IdentitySet, it's hardly worth it: PluggableSet: n := 1000000. set := PluggableSet new: n. set hashBlock: [:obj | obj basicIdentityHash << 12 + (obj class basicIdentityHash) ]. set equalBlock: [ :a :b | a == b ]. Time millisecondsToRun: [1 to: n do: [:waste | set add: Object new.]] 18828 The one I suggested in a custom subclass: n := 1000000. set := PseudoIdentitySet new: n. Time millisecondsToRun: [1 to: n do: [:waste | set add: Object new.]] 14182 Usual identitySet: n := 1000000. set := IdentitySet new: n. Time millisecondsToRun: [1 to: n do: [:waste | set add: Object new.]] 4595 Custom subclass with a slightly modified hash (anObject identityHash + (anObject class basicIdentityHash bitShift: 6)): n := 1000000. set := PseudoIdentitySet2 new: n. Time millisecondsToRun: [1 to: n do: [:waste | set add: Object new.]] 3694 Why 6 you ask? Because I'm worth it. Not really enough to justify the extra compexity, imho. Cheers, Henry |
In reply to this post by Mariano Martinez Peck
On 16.12.2011 01:06, Henrik Sperre Johansen wrote:
> > Usual identitySet: > n := 1000000. > set := IdentitySet new: n. > Time millisecondsToRun: [1 to: n do: [:waste | set add: Object new.]] > 4595 > > Custom subclass with a slightly modified hash (anObject identityHash + > (anObject class basicIdentityHash bitShift: 6)): > n := 1000000. > set := PseudoIdentitySet2 new: n. > Time millisecondsToRun: [1 to: n do: [:waste | set add: Object new.]] > 3694 > Why 6 you ask? Because I'm worth it. > > Not really enough to justify the extra compexity, imho. > > Cheers, > Henry > preallocating to a sane value: n := 1000000. set := IdentitySet new. Time millisecondsToRun: [1 to: n do: [:waste | set add: Object new.]] 6872 Cheers, Henry |
In reply to this post by Henrik Sperre Johansen
On Fri, 16 Dec 2011, Henrik Sperre Johansen wrote:
> On 16.12.2011 00:40, Henrik Sperre Johansen wrote: >> On 15.12.2011 23:35, Mariano Martinez Peck wrote: >>> Well...it is much slower :( it seems that the cost of (aKey identityHash >>> + ( aKey mareaClass identityHash bitShift: 12) + (aKey basicSize bitShift: >>> 24) >>> is bigger than the colisions. >>> Anyway, thanks for the nice thread. I learned. >>> >>> Cheers >>> >> Well, first of all, that always creates a largeInteger, since identityHash >> is already shifted by 22... You'd want to be using basicIdentityHash. >> 2nd off, basicSize is useless for non-variable classes, which I guess is >> the common case. >> 3rd, putting class hash in the high bits won't really help much if you're >> serializing many instances of the same class, as they will all occupy a >> sequential hash range. >> >> So if I were you, I'd try with something like >> obj basicIdentityHash << 12 + (obj class basicIdentityHash) >> before discarding it outright due to computation cost. >> >> Cheers, >> Henry >> > A few more points: > 1) don't use PluggableSets. Block activations are costly, and you REALLY need > specialized data/hash functions for it to be worth using. > 2) Even with a subclass of IdentitySet, it's hardly worth it: > > PluggableSet: > n := 1000000. > set := PluggableSet new: n. > set hashBlock: [:obj | obj basicIdentityHash << 12 + (obj class > basicIdentityHash) ]. > set equalBlock: [ :a :b | a == b ]. > Time millisecondsToRun: [1 to: n do: [:waste | set add: Object new.]] 18828 > > The one I suggested in a custom subclass: > n := 1000000. > set := PseudoIdentitySet new: n. > Time millisecondsToRun: [1 to: n do: [:waste | set add: Object new.]] 14182 > > Usual identitySet: > n := 1000000. > set := IdentitySet new: n. > Time millisecondsToRun: [1 to: n do: [:waste | set add: Object new.]] 4595 > > Custom subclass with a slightly modified hash (anObject identityHash + > (anObject class basicIdentityHash bitShift: 6)): > n := 1000000. > set := PseudoIdentitySet2 new: n. > Time millisecondsToRun: [1 to: n do: [:waste | set add: Object new.]] 3694 > Why 6 you ask? Because I'm worth it. > > Not really enough to justify the extra compexity, imho. How about my numbers? :) "Preallocate objects, so we won't count gc time." n := 1000000. objects := Array new: n streamContents: [ :stream | n timesRepeat: [ stream nextPut: Object new ] ]. set := IdentitySet new: n. Smalltalk garbageCollect. [1 to: n do: [ :i | set add: (objects at: i) ] ] timeToRun. "4949" set := LargeIdentitySet new. Smalltalk garbageCollect. [1 to: n do: [ :i | set add: (objects at: i) ] ] timeToRun. "331" set := (PluggableSet new: n) hashBlock: [ :object | object identityHash * 4096 + object class identityHash * 64 ]; "Change this to #basicIdentityHash in Pharo" equalBlock: [ :a :b | a == b ]; yourself. Smalltalk garbageCollect. [1 to: n do: [ :i | set add: (objects at: i) ] ] timeToRun. "5511" I also have a LargeIdentityDictionary, which is relatively fast, but not as fast as LargeIdentitySet, because (for some unknown reason) we don't have a primitive that could support it. If we had a primitive like primitive 132 which would return the index of the element if found or 0 if not, then we could have a really fast LargeIdentityDictionary. Levente > > Cheers, > Henry > > |
Free forum by Nabble | Edit this page |