Smalltalk › Pharo › Pharo Smalltalk Developers

IdentitySet but using #hash rather than #identityHash ?

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

62 messages Options

1234

Mariano Martinez Peck

IdentitySet but using #hash rather than #identityHash ?

Hi guys. I hope this is not a very stupid question. Background: in Fuel we have a IdentityDictionary where we put each of the objects we find while traversing the graph. We need to use IdentitySet because we cannot have repetitions (and to avoid loops) so we NEED to use #==. In such dictionary we put ALL objects of the graph, so it can be very big. Since IdentitySet uses #identityHash, it means it will be using those ONLY 12 bits in the object header. It means that we have 2^12 = 4096 different values.

Question: having explained the previous, I wanted to be able to use #hash rather than #identityHash since several classes implement #hash and hence I thought that using #hash I could have less colisions and hence a better performance. I tried to make a subclass of IdentitySet that uses #hash rather than #identityHash but my image freezes. I also tried something like:

set := PluggableSet new.
set hashBlock: [ :elem | elem hash ].
set equalBlock: [ :a :b | a == b ].

But it doesn't work either. I works with simple tests in a workspace but when I run the full tests of Fuel, it enters in a loop in the method #hash of Collection..

Anyway, my question is, should that work? if not, what is the exact reason?

Thanks in advance,

--
Mariano
http://marianopeck.wordpress.com

Levente Uzonyi-2

Re: IdentitySet but using #hash rather than #identityHash ?

On Mon, 12 Dec 2011, Mariano Martinez Peck wrote:

> Hi guys. I hope this is not a very stupid question. Background: in Fuel we
> have a IdentityDictionary where we put each of the objects we find while
> traversing the graph. We need to use IdentitySet because we cannot have
> repetitions (and to avoid loops) so we NEED to use #==. In such dictionary
> we put ALL objects of the graph, so it can be very big. Since IdentitySet
> uses #identityHash, it means it will be using those ONLY 12 bits in the
> object header. It means that we have 2^12 = 4096 different values.
>
> Question: having explained the previous, I wanted to be able to use #hash
> rather than #identityHash since several classes implement #hash and hence I
> thought that using #hash I could have less colisions and hence a better
> performance. I tried to make a subclass of IdentitySet that uses #hash
> rather than #identityHash but my image freezes. I also tried something like:
>
> set := PluggableSet new.
> set hashBlock: [ :elem | elem hash ].
> set equalBlock: [ :a :b | a == b ].
>
> But it doesn't work either. I works with simple tests in a workspace but
> when I run the full tests of Fuel, it enters in a loop in the method #hash
> of Collection..
>
> Anyway, my question is, should that work? if not, what is the exact reason?

If the serialized objects remain unchanged, then it can work. You can also
give another IdentitySet implementation a try, e.g.:
http://leves.web.elte.hu/squeak/LargeIdentitySet.st .

Levente

>
> Thanks in advance,
>
> --
> Mariano
> http://marianopeck.wordpress.com
>

Levente Uzonyi-2

Re: IdentitySet but using #hash rather than #identityHash ?

On Mon, 12 Dec 2011, Levente Uzonyi wrote:

> On Mon, 12 Dec 2011, Mariano Martinez Peck wrote:
>
>> Hi guys. I hope this is not a very stupid question. Background: in Fuel we
>> have a IdentityDictionary where we put each of the objects we find while
>> traversing the graph. We need to use IdentitySet because we cannot have
>> repetitions (and to avoid loops) so we NEED to use #==. In such dictionary
>> we put ALL objects of the graph, so it can be very big. Since IdentitySet
>> uses #identityHash, it means it will be using those ONLY 12 bits in the
>> object header. It means that we have 2^12 = 4096 different values.
>>
>> Question: having explained the previous, I wanted to be able to use #hash
>> rather than #identityHash since several classes implement #hash and hence I
>> thought that using #hash I could have less colisions and hence a better
>> performance. I tried to make a subclass of IdentitySet that uses #hash
>> rather than #identityHash but my image freezes. I also tried something
>> like:
>>
>> set := PluggableSet new.
>> set hashBlock: [ :elem | elem hash ].
>> set equalBlock: [ :a :b | a == b ].
>>
>> But it doesn't work either. I works with simple tests in a workspace but
>> when I run the full tests of Fuel, it enters in a loop in the method #hash
>> of Collection..
>>
>> Anyway, my question is, should that work? if not, what is the exact reason?
>
> If the serialized objects remain unchanged, then it can work. You can also
> give another IdentitySet implementation a try, e.g.:
> http://leves.web.elte.hu/squeak/LargeIdentitySet.st .

You can also take a look at SystemTracer, because (IIRC) it uses an
IdentitySet which will contain all objects at a point, and it's reasonably
fast. IIRC the idea is to use the identityHash of the object mixed with
the hash of its class to increase the number of bits in the key.

Levente

>
>
> Levente
>
>>
>> Thanks in advance,
>>
>> --
>> Mariano
>> http://marianopeck.wordpress.com
>>
>
>

carlo.t

Re: IdentitySet but using #hash rather than #identityHash ?

In reply to this post by Mariano Martinez Peck

Wouldn't the fact that you use hash cause potential loops now? e.g. collection refers to another object that refers to first collection. --> aCollection>>hash references an item which causes this current collection's hash to be called again?

identityHash is deterministic in this case.

Does this help?

Cheers

Carlo

On 12 Dec 2011, at 10:58 AM, Mariano Martinez Peck wrote:

Mariano Martinez Peck

Re: IdentitySet but using #hash rather than #identityHash ?

On Mon, Dec 12, 2011 at 1:56 PM, Carlo <[hidden email]> wrote:

Hi

Wouldn't the fact that you use hash cause potential loops now? e.g. collection refers to another object that refers to first collection. --> aCollection>>hash references an item which causes this current collection's hash to be called again?

Hi Carlo. I am still newbie with Collections but I think I am having exactly that problem. During my tests, it loops in Collection >> #hash when sending #hash to its elements.
Sorry, but I couldn't undertand what is the cause of the problem? why it doesn't work while it does using #identityHash? could you elaborate?

thanks

identityHash is deterministic in this case.

Does this help?
Cheers
Carlo

On 12 Dec 2011, at 10:58 AM, Mariano Martinez Peck wrote:

Hi guys. I hope this is not a very stupid question. Background: in Fuel we have a IdentityDictionary where we put each of the objects we find while traversing the graph. We need to use IdentitySet because we cannot have repetitions (and to avoid loops) so we NEED to use #==. In such dictionary we put ALL objects of the graph, so it can be very big. Since IdentitySet uses #identityHash, it means it will be using those ONLY 12 bits in the object header. It means that we have 2^12 = 4096 different values.

Question: having explained the previous, I wanted to be able to use #hash rather than #identityHash since several classes implement #hash and hence I thought that using #hash I could have less colisions and hence a better performance. I tried to make a subclass of IdentitySet that uses #hash rather than #identityHash but my image freezes. I also tried something like:

set := PluggableSet new.
set hashBlock: [ :elem | elem hash ].
set equalBlock: [ :a :b | a == b ].

But it doesn't work either. I works with simple tests in a workspace but when I run the full tests of Fuel, it enters in a loop in the method #hash of Collection..

Anyway, my question is, should that work? if not, what is the exact reason?

Thanks in advance,

--
Mariano
http://marianopeck.wordpress.com

--
Mariano
http://marianopeck.wordpress.com

Mariano Martinez Peck

Re: IdentitySet but using #hash rather than #identityHash ?

In reply to this post by Levente Uzonyi-2

Thanks Levente. I will try both. Actually, I would also need a fast (or at least I would like to use #hash) IdentityDictionary apart from a IdentitySet ;)

On Mon, Dec 12, 2011 at 1:52 PM, Levente Uzonyi <[hidden email]> wrote:

On Mon, 12 Dec 2011, Levente Uzonyi wrote:

On Mon, 12 Dec 2011, Mariano Martinez Peck wrote:

Hi guys. I hope this is not a very stupid question. Background: in Fuel we
have a IdentityDictionary where we put each of the objects we find while
traversing the graph. We need to use IdentitySet because we cannot have
repetitions (and to avoid loops) so we NEED to use #==. In such dictionary
we put ALL objects of the graph, so it can be very big. Since IdentitySet
uses #identityHash, it means it will be using those ONLY 12 bits in the
object header. It means that we have 2^12 = 4096 different values.

Question: having explained the previous, I wanted to be able to use #hash
rather than #identityHash since several classes implement #hash and hence I
thought that using #hash I could have less colisions and hence a better
performance. I tried to make a subclass of IdentitySet that uses #hash
rather than #identityHash but my image freezes. I also tried something like:

set := PluggableSet new.
set hashBlock: [ :elem | elem hash ].
set equalBlock: [ :a :b | a == b ].

But it doesn't work either. I works with simple tests in a workspace but
when I run the full tests of Fuel, it enters in a loop in the method #hash
of Collection..

Anyway, my question is, should that work? if not, what is the exact reason?

If the serialized objects remain unchanged, then it can work. You can also give another IdentitySet implementation a try, e.g.:
http://leves.web.elte.hu/squeak/LargeIdentitySet.st .

You can also take a look at SystemTracer, because (IIRC) it uses an IdentitySet which will contain all objects at a point, and it's reasonably fast. IIRC the idea is to use the identityHash of the object mixed with the hash of its class to increase the number of bits in the key.

Levente

Levente

Thanks in advance,

--
Mariano
http://marianopeck.wordpress.com

--
Mariano
http://marianopeck.wordpress.com

Mariano Martinez Peck

Re: IdentitySet but using #hash rather than #identityHash ?

In reply to this post by Mariano Martinez Peck

On Tue, Dec 13, 2011 at 12:32 AM, Mariano Martinez Peck <[hidden email]> wrote:

On Mon, Dec 12, 2011 at 1:56 PM, Carlo <[hidden email]> wrote:

Hi

Wouldn't the fact that you use hash cause potential loops now? e.g. collection refers to another object that refers to first collection. --> aCollection>>hash references an item which causes this current collection's hash to be called again?

Hi Carlo. I am still newbie with Collections but I think I am having exactly that problem. During my tests, it loops in Collection >> #hash when sending #hash to its elements.
Sorry, but I couldn't undertand what is the cause of the problem? why it doesn't work while it does using #identityHash? could you elaborate?

Well, now I understood, and I understand also why it doesn't happen with #identityHash. But what happens then with regular Dictionaries using #hash? why it doesn't happen there?

thanks

identityHash is deterministic in this case.

Does this help?
Cheers
Carlo

On 12 Dec 2011, at 10:58 AM, Mariano Martinez Peck wrote:

Hi guys. I hope this is not a very stupid question. Background: in Fuel we have a IdentityDictionary where we put each of the objects we find while traversing the graph. We need to use IdentitySet because we cannot have repetitions (and to avoid loops) so we NEED to use #==. In such dictionary we put ALL objects of the graph, so it can be very big. Since IdentitySet uses #identityHash, it means it will be using those ONLY 12 bits in the object header. It means that we have 2^12 = 4096 different values.

Question: having explained the previous, I wanted to be able to use #hash rather than #identityHash since several classes implement #hash and hence I thought that using #hash I could have less colisions and hence a better performance. I tried to make a subclass of IdentitySet that uses #hash rather than #identityHash but my image freezes. I also tried something like:

set := PluggableSet new.
set hashBlock: [ :elem | elem hash ].
set equalBlock: [ :a :b | a == b ].

But it doesn't work either. I works with simple tests in a workspace but when I run the full tests of Fuel, it enters in a loop in the method #hash of Collection..

Anyway, my question is, should that work? if not, what is the exact reason?

Thanks in advance,

--
Mariano
http://marianopeck.wordpress.com

--
Mariano
http://marianopeck.wordpress.com

--
Mariano
http://marianopeck.wordpress.com

carlo.t

Re: IdentitySet but using #hash rather than #identityHash ?

Hi Mariano

I'm no expert either ;)

Without having access to the exact code it would look like either you have a collection that references itself (which would break all collection implementations) or maybe the tests have just slowed down to the point where you think it's 'crashed'.

Do you have anymore info or perhaps which methods you changed on IdentitySet?

Cheers

Carlo

On 13 Dec 2011, at 1:57 AM, Mariano Martinez Peck wrote:

On Tue, Dec 13, 2011 at 12:32 AM, Mariano Martinez Peck <[hidden email]> wrote:

On Mon, Dec 12, 2011 at 1:56 PM, Carlo <[hidden email]> wrote:

Hi

Wouldn't the fact that you use hash cause potential loops now? e.g. collection refers to another object that refers to first collection. --> aCollection>>hash references an item which causes this current collection's hash to be called again?

Hi Carlo. I am still newbie with Collections but I think I am having exactly that problem. During my tests, it loops in Collection >> #hash when sending #hash to its elements.
Sorry, but I couldn't undertand what is the cause of the problem? why it doesn't work while it does using #identityHash? could you elaborate?

Well, now I understood, and I understand also why it doesn't happen with #identityHash. But what happens then with regular Dictionaries using #hash? why it doesn't happen there?

thanks

identityHash is deterministic in this case.

Does this help?
Cheers
Carlo

On 12 Dec 2011, at 10:58 AM, Mariano Martinez Peck wrote:

Hi guys. I hope this is not a very stupid question. Background: in Fuel we have a IdentityDictionary where we put each of the objects we find while traversing the graph. We need to use IdentitySet because we cannot have repetitions (and to avoid loops) so we NEED to use #==. In such dictionary we put ALL objects of the graph, so it can be very big. Since IdentitySet uses #identityHash, it means it will be using those ONLY 12 bits in the object header. It means that we have 2^12 = 4096 different values.

Question: having explained the previous, I wanted to be able to use #hash rather than #identityHash since several classes implement #hash and hence I thought that using #hash I could have less colisions and hence a better performance. I tried to make a subclass of IdentitySet that uses #hash rather than #identityHash but my image freezes. I also tried something like:

set := PluggableSet new.
set hashBlock: [ :elem | elem hash ].
set equalBlock: [ :a :b | a == b ].

But it doesn't work either. I works with simple tests in a workspace but when I run the full tests of Fuel, it enters in a loop in the method #hash of Collection..

Anyway, my question is, should that work? if not, what is the exact reason?

Thanks in advance,

--
Mariano
http://marianopeck.wordpress.com

--
Mariano
http://marianopeck.wordpress.com

--
Mariano
http://marianopeck.wordpress.com

Michael Roberts-2

Re: IdentitySet but using #hash rather than #identityHash ?

Hi Mariano, when I read this thread I was a bit confused that you wanted an IdentitySet that used #hash. From that statement it sounded like you just wanted a Set. This would allow any object to define its own hash and importantly what equality means with #=. So if you want to delegate that to the object use a Set.

However as the thread has gone on perhaps you want the identity relationship but you just wanted a bigger identity hash space?

IdentitySets are fast because they bypass any delegation. Once you have seen the object in your traversal (common usage pattern) that's it. You grab it's identity which is a pretty low level thing to do.

As for what happens with collections who knows. Depends. Relying on identity set semantics for a collection is easy. Set semantics is not so. Remember that both hash and equals are important to know if the set already contains the element. Depending on the collection implementation both of these could be composite in terms of the parts. Who knows where you end or how long it takes. I.e. if it is a function of the size of the collection and further collections are composite....

Cheers
Mike

On Tuesday, December 13, 2011, Carlo <[hidden email]> wrote:
> Hi Mariano
> I'm no expert either ;)
> Without having access to the exact code it would look like either you have a collection that references itself (which would break all collection implementations) or maybe the tests have just slowed down to the point where you think it's 'crashed'.
> Do you have anymore info or perhaps which methods you changed on IdentitySet?
> Cheers
> Carlo
> On 13 Dec 2011, at 1:57 AM, Mariano Martinez Peck wrote:
>
>
> On Tue, Dec 13, 2011 at 12:32 AM, Mariano Martinez Peck <[hidden email]> wrote:
>>
>>
>> On Mon, Dec 12, 2011 at 1:56 PM, Carlo <[hidden email]> wrote:
>>>
>>> Hi
>>> Wouldn't the fact that you use hash cause potential loops now? e.g. collection refers to another object that refers to first collection. --> aCollection>>hash references an item which causes this current collection's hash to be called again?
>>
>> Hi Carlo. I am still newbie with Collections but I think I am having exactly that problem. During my tests, it loops in Collection >> #hash when sending #hash to its elements.
>> Sorry, but I couldn't undertand what is the cause of the problem? why it doesn't work while it does using #identityHash? could you elaborate?
>>
>
> Well, now I understood, and I understand also why it doesn't happen with #identityHash. But what happens then with regular Dictionaries using #hash? why it doesn't happen there?
>
>>
>> thanks
>>
>>>
>>> identityHash is deterministic in this case.
>>> Does this help?
>>> Cheers
>>> Carlo
>>> On 12 Dec 2011, at 10:58 AM, Mariano Martinez Peck wrote:
>>> Hi guys. I hope this is not a very stupid question. Background: in Fuel we have a IdentityDictionary where we put each of the objects we find while traversing the graph. We need to use IdentitySet because we cannot have repetitions (and to avoid loops) so we NEED to use #==. In such dictionary we put ALL objects of the graph, so it can be very big. Since IdentitySet uses #identityHash, it means it will be using those ONLY 12 bits in the object header. It means that we have 2^12 = 4096 different values.
>>>
>>> Question: having explained the previous, I wanted to be able to use #hash rather than #identityHash since several classes implement #hash and hence I thought that using #hash I could have less colisions and hence a better performance. I tried to make a subclass of IdentitySet that uses #hash rather than #identityHash but my image freezes. I also tried something like:
>>>
>>> set := PluggableSet new.
>>> set hashBlock: [ :elem | elem hash ].
>>> set equalBlock: [ :a :b | a == b ].
>>>
>>> But it doesn't work either. I works with simple tests in a workspace but when I run the full tests of Fuel, it enters in a loop in the method #hash of Collection..
>>>
>>> Anyway, my question is, should that work? if not, what is the exact reason?
>>>
>>> Thanks in advance,
>>>
>>> --
>>> Mariano
>>> http://marianopeck.wordpress.com
>>>

>>>
>>
>>
>>
>> --
>> Mariano
>> http://marianopeck.wordpress.com
>>
>
>

>
> --
> Mariano
> http://marianopeck.wordpress.com
>
>
>

Mariano Martinez Peck

Re: IdentitySet but using #hash rather than #identityHash ?

On Tue, Dec 13, 2011 at 8:43 AM, Michael Roberts <[hidden email]> wrote:

Hi Mariano, when I read this thread I was a bit confused that you wanted an IdentitySet that used #hash. From that statement it sounded like you just wanted a Set. This would allow any object to define its own hash and importantly what equality means with #=. So if you want to delegate that to the object use a Set.

No, I cannot use a Set because I cannot have repetitions. This that I am asking is what we do in Fuel serializer while traversing the object graph. Each analyzed object is put in an IdentityDictionary (but the question is the same for IdentitySet) as key. To avoid cycles, I need to put each object only once. Since graphs can be very big, such dict could have lots of objects.

However as the thread has gone on perhaps you want the identity relationship but you just wanted a bigger identity hash space?

Yes, exactly. I were thinking if there could be a way (maybe...that's why I am asking) of improve its performance considering that I could have much more objects than 2^13. In other words, I wanted to see if I could avoid colisions.

IdentitySets are fast because they bypass any delegation. Once you have seen the object in your traversal (common usage pattern) that's it. You grab it's identity which is a pretty low level thing to do.

Ok...it is a tradeoff. If I use #identityHash it is fast because there is no delegation and it is almost an immediate primitive. But I gues it will be slow if there are lots of colisions. Not using #identityHash but something else maybe could decrease maybe the amount of colisions, but maybe with the delegation it will gets slower.

I will try with Levente idea of what it is done in SystemTracer: use as a hash the identityHash of the object mixed with the identityHash of its class. Maybe that decreases the colisions and at the same time I don't pay delegation (#class is special bytecode, so nothing, and ok..there are 2 sends to #identityhash but I don't thinnk it is that much).

Anyway, thanks for the interesting post, I always learn :)

As for what happens with collections who knows. Depends. Relying on identity set semantics for a collection is easy. Set semantics is not so. Remember that both hash and equals are important to know if the set already contains the element. Depending on the collection implementation both of these could be composite in terms of the parts. Who knows where you end or how long it takes. I.e. if it is a function of the size of the collection and further collections are composite....

yes...

Cheers
Mike

On Tuesday, December 13, 2011, Carlo <[hidden email]> wrote:
> Hi Mariano
> I'm no expert either ;)
> Without having access to the exact code it would look like either you have a collection that references itself (which would break all collection implementations) or maybe the tests have just slowed down to the point where you think it's 'crashed'.
> Do you have anymore info or perhaps which methods you changed on IdentitySet?
> Cheers
> Carlo
> On 13 Dec 2011, at 1:57 AM, Mariano Martinez Peck wrote:
>
>
> On Tue, Dec 13, 2011 at 12:32 AM, Mariano Martinez Peck <[hidden email]> wrote:
>>
>>
>> On Mon, Dec 12, 2011 at 1:56 PM, Carlo <[hidden email]> wrote:
>>>
>>> Hi
>>> Wouldn't the fact that you use hash cause potential loops now? e.g. collection refers to another object that refers to first collection. --> aCollection>>hash references an item which causes this current collection's hash to be called again?
>>
>> Hi Carlo. I am still newbie with Collections but I think I am having exactly that problem. During my tests, it loops in Collection >> #hash when sending #hash to its elements.
>> Sorry, but I couldn't undertand what is the cause of the problem? why it doesn't work while it does using #identityHash? could you elaborate?
>>
>
> Well, now I understood, and I understand also why it doesn't happen with #identityHash. But what happens then with regular Dictionaries using #hash? why it doesn't happen there?
>
>>
>> thanks
>>
>>>
>>> identityHash is deterministic in this case.
>>> Does this help?
>>> Cheers
>>> Carlo
>>> On 12 Dec 2011, at 10:58 AM, Mariano Martinez Peck wrote:
>>> Hi guys. I hope this is not a very stupid question. Background: in Fuel we have a IdentityDictionary where we put each of the objects we find while traversing the graph. We need to use IdentitySet because we cannot have repetitions (and to avoid loops) so we NEED to use #==. In such dictionary we put ALL objects of the graph, so it can be very big. Since IdentitySet uses #identityHash, it means it will be using those ONLY 12 bits in the object header. It means that we have 2^12 = 4096 different values.
>>>
>>> Question: having explained the previous, I wanted to be able to use #hash rather than #identityHash since several classes implement #hash and hence I thought that using #hash I could have less colisions and hence a better performance. I tried to make a subclass of IdentitySet that uses #hash rather than #identityHash but my image freezes. I also tried something like:
>>>
>>> set := PluggableSet new.
>>> set hashBlock: [ :elem | elem hash ].
>>> set equalBlock: [ :a :b | a == b ].
>>>
>>> But it doesn't work either. I works with simple tests in a workspace but when I run the full tests of Fuel, it enters in a loop in the method #hash of Collection..
>>>
>>> Anyway, my question is, should that work? if not, what is the exact reason?
>>>
>>> Thanks in advance,
>>>
>>> --
>>> Mariano
>>> http://marianopeck.wordpress.com
>>>
>>>
>>
>>
>>
>> --
>> Mariano
>> http://marianopeck.wordpress.com
>>
>
>
>
> --
> Mariano
> http://marianopeck.wordpress.com
>
>
>

--
Mariano
http://marianopeck.wordpress.com

Michael Roberts-2

Re: IdentitySet but using #hash rather than #identityHash ?

Ok I see. Have you measured the collisions on a very big graph? What stats do you get?

Cheers Mike

On 13 Dec 2011, at 08:37, Mariano Martinez Peck <[hidden email]> wrote:

On Tue, Dec 13, 2011 at 8:43 AM, Michael Roberts <[hidden email]> wrote:
Hi Mariano, when I read this thread I was a bit confused that you wanted an IdentitySet that used #hash. From that statement it sounded like you just wanted a Set. This would allow any object to define its own hash and importantly what equality means with #=. So if you want to delegate that to the object use a Set.

No, I cannot use a Set because I cannot have repetitions. This that I am asking is what we do in Fuel serializer while traversing the object graph. Each analyzed object is put in an IdentityDictionary (but the question is the same for IdentitySet) as key. To avoid cycles, I need to put each object only once. Since graphs can be very big, such dict could have lots of objects.

However as the thread has gone on perhaps you want the identity relationship but you just wanted a bigger identity hash space?

Yes, exactly. I were thinking if there could be a way (maybe...that's why I am asking) of improve its performance considering that I could have much more objects than 2^13. In other words, I wanted to see if I could avoid colisions.

IdentitySets are fast because they bypass any delegation. Once you have seen the object in your traversal (common usage pattern) that's it. You grab it's identity which is a pretty low level thing to do.

Ok...it is a tradeoff. If I use #identityHash it is fast because there is no delegation and it is almost an immediate primitive. But I gues it will be slow if there are lots of colisions. Not using #identityHash but something else maybe could decrease maybe the amount of colisions, but maybe with the delegation it will gets slower.

I will try with Levente idea of what it is done in SystemTracer: use as a hash the identityHash of the object mixed with the identityHash of its class. Maybe that decreases the colisions and at the same time I don't pay delegation (#class is special bytecode, so nothing, and ok..there are 2 sends to #identityhash but I don't thinnk it is that much).

Anyway, thanks for the interesting post, I always learn :)

As for what happens with collections who knows. Depends. Relying on identity set semantics for a collection is easy. Set semantics is not so. Remember that both hash and equals are important to know if the set already contains the element. Depending on the collection implementation both of these could be composite in terms of the parts. Who knows where you end or how long it takes. I.e. if it is a function of the size of the collection and further collections are composite....

yes...

Cheers
Mike

On Tuesday, December 13, 2011, Carlo <[hidden email]> wrote:
> Hi Mariano
> I'm no expert either ;)
> Without having access to the exact code it would look like either you have a collection that references itself (which would break all collection implementations) or maybe the tests have just slowed down to the point where you think it's 'crashed'.
> Do you have anymore info or perhaps which methods you changed on IdentitySet?
> Cheers
> Carlo
> On 13 Dec 2011, at 1:57 AM, Mariano Martinez Peck wrote:
>
>
> On Tue, Dec 13, 2011 at 12:32 AM, Mariano Martinez Peck <[hidden email]> wrote:
>>
>>
>> On Mon, Dec 12, 2011 at 1:56 PM, Carlo <[hidden email]> wrote:
>>>
>>> Hi
>>> Wouldn't the fact that you use hash cause potential loops now? e.g. collection refers to another object that refers to first collection. --> aCollection>>hash references an item which causes this current collection's hash to be called again?
>>
>> Hi Carlo. I am still newbie with Collections but I think I am having exactly that problem. During my tests, it loops in Collection >> #hash when sending #hash to its elements.
>> Sorry, but I couldn't undertand what is the cause of the problem? why it doesn't work while it does using #identityHash? could you elaborate?
>>
>
> Well, now I understood, and I understand also why it doesn't happen with #identityHash. But what happens then with regular Dictionaries using #hash? why it doesn't happen there?
>
>>
>> thanks
>>
>>>
>>> identityHash is deterministic in this case.
>>> Does this help?
>>> Cheers
>>> Carlo
>>> On 12 Dec 2011, at 10:58 AM, Mariano Martinez Peck wrote:
>>> Hi guys. I hope this is not a very stupid question. Background: in Fuel we have a IdentityDictionary where we put each of the objects we find while traversing the graph. We need to use IdentitySet because we cannot have repetitions (and to avoid loops) so we NEED to use #==. In such dictionary we put ALL objects of the graph, so it can be very big. Since IdentitySet uses #identityHash, it means it will be using those ONLY 12 bits in the object header. It means that we have 2^12 = 4096 different values.
>>>
>>> Question: having explained the previous, I wanted to be able to use #hash rather than #identityHash since several classes implement #hash and hence I thought that using #hash I could have less colisions and hence a better performance. I tried to make a subclass of IdentitySet that uses #hash rather than #identityHash but my image freezes. I also tried something like:
>>>
>>> set := PluggableSet new.
>>> set hashBlock: [ :elem | elem hash ].
>>> set equalBlock: [ :a :b | a == b ].
>>>
>>> But it doesn't work either. I works with simple tests in a workspace but when I run the full tests of Fuel, it enters in a loop in the method #hash of Collection..
>>>
>>> Anyway, my question is, should that work? if not, what is the exact reason?
>>>
>>> Thanks in advance,
>>>
>>> --
>>> Mariano
>>> http://marianopeck.wordpress.com
>>>
>>>
>>
>>
>>
>> --
>> Mariano
>> http://marianopeck.wordpress.com
>>
>
>
>
> --
> Mariano
> http://marianopeck.wordpress.com
>
>
>

--
Mariano
http://marianopeck.wordpress.com

Eliot Miranda-2

Re: IdentitySet but using #hash rather than #identityHash ?

In reply to this post by Mariano Martinez Peck

Hi Mariano,

On Tue, Dec 13, 2011 at 12:37 AM, Mariano Martinez Peck <[hidden email]> wrote:

On Tue, Dec 13, 2011 at 8:43 AM, Michael Roberts <[hidden email]> wrote:
Hi Mariano, when I read this thread I was a bit confused that you wanted an IdentitySet that used #hash. From that statement it sounded like you just wanted a Set. This would allow any object to define its own hash and importantly what equality means with #=. So if you want to delegate that to the object use a Set.

No, I cannot use a Set because I cannot have repetitions. This that I am asking is what we do in Fuel serializer while traversing the object graph. Each analyzed object is put in an IdentityDictionary (but the question is the same for IdentitySet) as key. To avoid cycles, I need to put each object only once. Since graphs can be very big, such dict could have lots of objects.

However as the thread has gone on perhaps you want the identity relationship but you just wanted a bigger identity hash space?

Yes, exactly. I were thinking if there could be a way (maybe...that's why I am asking) of improve its performance considering that I could have much more objects than 2^13. In other words, I wanted to see if I could avoid colisions.

You can assume that certain properties of objects will not change during serialization, for example the class of objects, the basic size of objects. So you can construct a valid extended identity hash from these properties. For example

fuelSerializationHash

^self identityHash + (self class identityHash bitShift: 12) + (self basicSize bitShift: 24)

In general this idea may not work because of meta-primitives like changeClassTo:, which would change the hash. But in Fuel's case I think it's safe to assume that objects won't change class or size during serialization.

HTH

IdentitySets are fast because they bypass any delegation. Once you have seen the object in your traversal (common usage pattern) that's it. You grab it's identity which is a pretty low level thing to do.

Ok...it is a tradeoff. If I use #identityHash it is fast because there is no delegation and it is almost an immediate primitive. But I gues it will be slow if there are lots of colisions. Not using #identityHash but something else maybe could decrease maybe the amount of colisions, but maybe with the delegation it will gets slower.

I will try with Levente idea of what it is done in SystemTracer: use as a hash the identityHash of the object mixed with the identityHash of its class. Maybe that decreases the colisions and at the same time I don't pay delegation (#class is special bytecode, so nothing, and ok..there are 2 sends to #identityhash but I don't thinnk it is that much).

Anyway, thanks for the interesting post, I always learn :)

As for what happens with collections who knows. Depends. Relying on identity set semantics for a collection is easy. Set semantics is not so. Remember that both hash and equals are important to know if the set already contains the element. Depending on the collection implementation both of these could be composite in terms of the parts. Who knows where you end or how long it takes. I.e. if it is a function of the size of the collection and further collections are composite....

yes...

Cheers
Mike

On Tuesday, December 13, 2011, Carlo <[hidden email]> wrote:
> Hi Mariano
> I'm no expert either ;)
> Without having access to the exact code it would look like either you have a collection that references itself (which would break all collection implementations) or maybe the tests have just slowed down to the point where you think it's 'crashed'.
> Do you have anymore info or perhaps which methods you changed on IdentitySet?
> Cheers
> Carlo
> On 13 Dec 2011, at 1:57 AM, Mariano Martinez Peck wrote:
>
>
> On Tue, Dec 13, 2011 at 12:32 AM, Mariano Martinez Peck <[hidden email]> wrote:
>>
>>
>> On Mon, Dec 12, 2011 at 1:56 PM, Carlo <[hidden email]> wrote:
>>>
>>> Hi
>>> Wouldn't the fact that you use hash cause potential loops now? e.g. collection refers to another object that refers to first collection. --> aCollection>>hash references an item which causes this current collection's hash to be called again?
>>
>> Hi Carlo. I am still newbie with Collections but I think I am having exactly that problem. During my tests, it loops in Collection >> #hash when sending #hash to its elements.
>> Sorry, but I couldn't undertand what is the cause of the problem? why it doesn't work while it does using #identityHash? could you elaborate?
>>
>
> Well, now I understood, and I understand also why it doesn't happen with #identityHash. But what happens then with regular Dictionaries using #hash? why it doesn't happen there?
>
>>
>> thanks
>>
>>>
>>> identityHash is deterministic in this case.
>>> Does this help?
>>> Cheers
>>> Carlo
>>> On 12 Dec 2011, at 10:58 AM, Mariano Martinez Peck wrote:
>>> Hi guys. I hope this is not a very stupid question. Background: in Fuel we have a IdentityDictionary where we put each of the objects we find while traversing the graph. We need to use IdentitySet because we cannot have repetitions (and to avoid loops) so we NEED to use #==. In such dictionary we put ALL objects of the graph, so it can be very big. Since IdentitySet uses #identityHash, it means it will be using those ONLY 12 bits in the object header. It means that we have 2^12 = 4096 different values.
>>>
>>> Question: having explained the previous, I wanted to be able to use #hash rather than #identityHash since several classes implement #hash and hence I thought that using #hash I could have less colisions and hence a better performance. I tried to make a subclass of IdentitySet that uses #hash rather than #identityHash but my image freezes. I also tried something like:
>>>
>>> set := PluggableSet new.
>>> set hashBlock: [ :elem | elem hash ].
>>> set equalBlock: [ :a :b | a == b ].
>>>
>>> But it doesn't work either. I works with simple tests in a workspace but when I run the full tests of Fuel, it enters in a loop in the method #hash of Collection..
>>>
>>> Anyway, my question is, should that work? if not, what is the exact reason?
>>>
>>> Thanks in advance,
>>>
>>> --
>>> Mariano
>>> http://marianopeck.wordpress.com
>>>
>>>
>>
>>
>>
>> --
>> Mariano
>> http://marianopeck.wordpress.com
>>
>
>
>
> --
> Mariano
> http://marianopeck.wordpress.com
>
>
>

--
Mariano
http://marianopeck.wordpress.com

--
best,

Eliot

Mariano Martinez Peck

Re: IdentitySet but using #hash rather than #identityHash ?

On Tue, Dec 13, 2011 at 8:44 PM, Eliot Miranda <[hidden email]> wrote:

Hi Mariano,

On Tue, Dec 13, 2011 at 12:37 AM, Mariano Martinez Peck <[hidden email]> wrote:

On Tue, Dec 13, 2011 at 8:43 AM, Michael Roberts <[hidden email]> wrote:
Hi Mariano, when I read this thread I was a bit confused that you wanted an IdentitySet that used #hash. From that statement it sounded like you just wanted a Set. This would allow any object to define its own hash and importantly what equality means with #=. So if you want to delegate that to the object use a Set.

No, I cannot use a Set because I cannot have repetitions. This that I am asking is what we do in Fuel serializer while traversing the object graph. Each analyzed object is put in an IdentityDictionary (but the question is the same for IdentitySet) as key. To avoid cycles, I need to put each object only once. Since graphs can be very big, such dict could have lots of objects.

However as the thread has gone on perhaps you want the identity relationship but you just wanted a bigger identity hash space?

Yes, exactly. I were thinking if there could be a way (maybe...that's why I am asking) of improve its performance considering that I could have much more objects than 2^13. In other words, I wanted to see if I could avoid colisions.

You can assume that certain properties of objects will not change during serialization, for example the class of objects, the basic size of objects. So you can construct a valid extended identity hash from these properties. For example

fuelSerializationHash
^self identityHash + (self class identityHash bitShift: 12) + (self basicSize bitShift: 24)

Thanks!!! that's exactly what I wanted to try :)

In general this idea may not work because of meta-primitives like changeClassTo:, which would change the hash. But in Fuel's case I think it's safe to assume that objects won't change class or size during serialization.

Exactly :) Moreoever, if objects change their state or shape, or, in other words, in the graph changes while we are in the middle of the serialization, we will be screw anyway....whether the hash has changed or not :)

I will see what is worst, if the colisions or having to send 2 times #identityHash and 1 #basicSize.

Thanks!

HTH

IdentitySets are fast because they bypass any delegation. Once you have seen the object in your traversal (common usage pattern) that's it. You grab it's identity which is a pretty low level thing to do.

Ok...it is a tradeoff. If I use #identityHash it is fast because there is no delegation and it is almost an immediate primitive. But I gues it will be slow if there are lots of colisions. Not using #identityHash but something else maybe could decrease maybe the amount of colisions, but maybe with the delegation it will gets slower.

I will try with Levente idea of what it is done in SystemTracer: use as a hash the identityHash of the object mixed with the identityHash of its class. Maybe that decreases the colisions and at the same time I don't pay delegation (#class is special bytecode, so nothing, and ok..there are 2 sends to #identityhash but I don't thinnk it is that much).

Anyway, thanks for the interesting post, I always learn :)

As for what happens with collections who knows. Depends. Relying on identity set semantics for a collection is easy. Set semantics is not so. Remember that both hash and equals are important to know if the set already contains the element. Depending on the collection implementation both of these could be composite in terms of the parts. Who knows where you end or how long it takes. I.e. if it is a function of the size of the collection and further collections are composite....

yes...

Cheers
Mike

On Tuesday, December 13, 2011, Carlo <[hidden email]> wrote:
> Hi Mariano
> I'm no expert either ;)
> Without having access to the exact code it would look like either you have a collection that references itself (which would break all collection implementations) or maybe the tests have just slowed down to the point where you think it's 'crashed'.
> Do you have anymore info or perhaps which methods you changed on IdentitySet?
> Cheers
> Carlo
> On 13 Dec 2011, at 1:57 AM, Mariano Martinez Peck wrote:
>
>
> On Tue, Dec 13, 2011 at 12:32 AM, Mariano Martinez Peck <[hidden email]> wrote:
>>
>>
>> On Mon, Dec 12, 2011 at 1:56 PM, Carlo <[hidden email]> wrote:
>>>
>>> Hi
>>> Wouldn't the fact that you use hash cause potential loops now? e.g. collection refers to another object that refers to first collection. --> aCollection>>hash references an item which causes this current collection's hash to be called again?
>>
>> Hi Carlo. I am still newbie with Collections but I think I am having exactly that problem. During my tests, it loops in Collection >> #hash when sending #hash to its elements.
>> Sorry, but I couldn't undertand what is the cause of the problem? why it doesn't work while it does using #identityHash? could you elaborate?
>>
>
> Well, now I understood, and I understand also why it doesn't happen with #identityHash. But what happens then with regular Dictionaries using #hash? why it doesn't happen there?
>
>>
>> thanks
>>
>>>
>>> identityHash is deterministic in this case.
>>> Does this help?
>>> Cheers
>>> Carlo
>>> On 12 Dec 2011, at 10:58 AM, Mariano Martinez Peck wrote:
>>> Hi guys. I hope this is not a very stupid question. Background: in Fuel we have a IdentityDictionary where we put each of the objects we find while traversing the graph. We need to use IdentitySet because we cannot have repetitions (and to avoid loops) so we NEED to use #==. In such dictionary we put ALL objects of the graph, so it can be very big. Since IdentitySet uses #identityHash, it means it will be using those ONLY 12 bits in the object header. It means that we have 2^12 = 4096 different values.
>>>
>>> Question: having explained the previous, I wanted to be able to use #hash rather than #identityHash since several classes implement #hash and hence I thought that using #hash I could have less colisions and hence a better performance. I tried to make a subclass of IdentitySet that uses #hash rather than #identityHash but my image freezes. I also tried something like:
>>>
>>> set := PluggableSet new.
>>> set hashBlock: [ :elem | elem hash ].
>>> set equalBlock: [ :a :b | a == b ].
>>>
>>> But it doesn't work either. I works with simple tests in a workspace but when I run the full tests of Fuel, it enters in a loop in the method #hash of Collection..
>>>
>>> Anyway, my question is, should that work? if not, what is the exact reason?
>>>
>>> Thanks in advance,
>>>
>>> --
>>> Mariano
>>> http://marianopeck.wordpress.com
>>>
>>>
>>
>>
>>
>> --
>> Mariano
>> http://marianopeck.wordpress.com
>>
>
>
>
> --
> Mariano
> http://marianopeck.wordpress.com
>
>
>

--
Mariano
http://marianopeck.wordpress.com

--
best,
Eliot

--
Mariano
http://marianopeck.wordpress.com

Mariano Martinez Peck

Re: IdentitySet but using #hash rather than #identityHash ?

Well...it is much slower :( it seems that the cost of (aKey identityHash + ( aKey mareaClass identityHash bitShift: 12) + (aKey basicSize bitShift: 24)
is bigger than the colisions.
Anyway, thanks for the nice thread. I learned.

Cheers

On Tue, Dec 13, 2011 at 8:51 PM, Mariano Martinez Peck <[hidden email]> wrote:

On Tue, Dec 13, 2011 at 8:44 PM, Eliot Miranda <[hidden email]> wrote:

Hi Mariano,

On Tue, Dec 13, 2011 at 12:37 AM, Mariano Martinez Peck <[hidden email]> wrote:

On Tue, Dec 13, 2011 at 8:43 AM, Michael Roberts <[hidden email]> wrote:
Hi Mariano, when I read this thread I was a bit confused that you wanted an IdentitySet that used #hash. From that statement it sounded like you just wanted a Set. This would allow any object to define its own hash and importantly what equality means with #=. So if you want to delegate that to the object use a Set.

No, I cannot use a Set because I cannot have repetitions. This that I am asking is what we do in Fuel serializer while traversing the object graph. Each analyzed object is put in an IdentityDictionary (but the question is the same for IdentitySet) as key. To avoid cycles, I need to put each object only once. Since graphs can be very big, such dict could have lots of objects.

However as the thread has gone on perhaps you want the identity relationship but you just wanted a bigger identity hash space?

Yes, exactly. I were thinking if there could be a way (maybe...that's why I am asking) of improve its performance considering that I could have much more objects than 2^13. In other words, I wanted to see if I could avoid colisions.

You can assume that certain properties of objects will not change during serialization, for example the class of objects, the basic size of objects. So you can construct a valid extended identity hash from these properties. For example

fuelSerializationHash
^self identityHash + (self class identityHash bitShift: 12) + (self basicSize bitShift: 24)

Thanks!!! that's exactly what I wanted to try :)

In general this idea may not work because of meta-primitives like changeClassTo:, which would change the hash. But in Fuel's case I think it's safe to assume that objects won't change class or size during serialization.

Exactly :) Moreoever, if objects change their state or shape, or, in other words, in the graph changes while we are in the middle of the serialization, we will be screw anyway....whether the hash has changed or not :)

I will see what is worst, if the colisions or having to send 2 times #identityHash and 1 #basicSize.

Thanks!

HTH

IdentitySets are fast because they bypass any delegation. Once you have seen the object in your traversal (common usage pattern) that's it. You grab it's identity which is a pretty low level thing to do.

Ok...it is a tradeoff. If I use #identityHash it is fast because there is no delegation and it is almost an immediate primitive. But I gues it will be slow if there are lots of colisions. Not using #identityHash but something else maybe could decrease maybe the amount of colisions, but maybe with the delegation it will gets slower.

I will try with Levente idea of what it is done in SystemTracer: use as a hash the identityHash of the object mixed with the identityHash of its class. Maybe that decreases the colisions and at the same time I don't pay delegation (#class is special bytecode, so nothing, and ok..there are 2 sends to #identityhash but I don't thinnk it is that much).

Anyway, thanks for the interesting post, I always learn :)

As for what happens with collections who knows. Depends. Relying on identity set semantics for a collection is easy. Set semantics is not so. Remember that both hash and equals are important to know if the set already contains the element. Depending on the collection implementation both of these could be composite in terms of the parts. Who knows where you end or how long it takes. I.e. if it is a function of the size of the collection and further collections are composite....

yes...

Cheers
Mike

On Tuesday, December 13, 2011, Carlo <[hidden email]> wrote:
> Hi Mariano
> I'm no expert either ;)
> Without having access to the exact code it would look like either you have a collection that references itself (which would break all collection implementations) or maybe the tests have just slowed down to the point where you think it's 'crashed'.
> Do you have anymore info or perhaps which methods you changed on IdentitySet?
> Cheers
> Carlo
> On 13 Dec 2011, at 1:57 AM, Mariano Martinez Peck wrote:
>
>
> On Tue, Dec 13, 2011 at 12:32 AM, Mariano Martinez Peck <[hidden email]> wrote:
>>
>>
>> On Mon, Dec 12, 2011 at 1:56 PM, Carlo <[hidden email]> wrote:
>>>
>>> Hi
>>> Wouldn't the fact that you use hash cause potential loops now? e.g. collection refers to another object that refers to first collection. --> aCollection>>hash references an item which causes this current collection's hash to be called again?
>>
>> Hi Carlo. I am still newbie with Collections but I think I am having exactly that problem. During my tests, it loops in Collection >> #hash when sending #hash to its elements.
>> Sorry, but I couldn't undertand what is the cause of the problem? why it doesn't work while it does using #identityHash? could you elaborate?
>>
>
> Well, now I understood, and I understand also why it doesn't happen with #identityHash. But what happens then with regular Dictionaries using #hash? why it doesn't happen there?
>
>>
>> thanks
>>
>>>
>>> identityHash is deterministic in this case.
>>> Does this help?
>>> Cheers
>>> Carlo
>>> On 12 Dec 2011, at 10:58 AM, Mariano Martinez Peck wrote:
>>> Hi guys. I hope this is not a very stupid question. Background: in Fuel we have a IdentityDictionary where we put each of the objects we find while traversing the graph. We need to use IdentitySet because we cannot have repetitions (and to avoid loops) so we NEED to use #==. In such dictionary we put ALL objects of the graph, so it can be very big. Since IdentitySet uses #identityHash, it means it will be using those ONLY 12 bits in the object header. It means that we have 2^12 = 4096 different values.
>>>
>>> Question: having explained the previous, I wanted to be able to use #hash rather than #identityHash since several classes implement #hash and hence I thought that using #hash I could have less colisions and hence a better performance. I tried to make a subclass of IdentitySet that uses #hash rather than #identityHash but my image freezes. I also tried something like:
>>>
>>> set := PluggableSet new.
>>> set hashBlock: [ :elem | elem hash ].
>>> set equalBlock: [ :a :b | a == b ].
>>>
>>> But it doesn't work either. I works with simple tests in a workspace but when I run the full tests of Fuel, it enters in a loop in the method #hash of Collection..
>>>
>>> Anyway, my question is, should that work? if not, what is the exact reason?
>>>
>>> Thanks in advance,
>>>
>>> --
>>> Mariano
>>> http://marianopeck.wordpress.com
>>>
>>>
>>
>>
>>
>> --
>> Mariano
>> http://marianopeck.wordpress.com
>>
>
>
>
> --
> Mariano
> http://marianopeck.wordpress.com
>
>
>

--
Mariano
http://marianopeck.wordpress.com

--
best,
Eliot

--
Mariano
http://marianopeck.wordpress.com

--
Mariano
http://marianopeck.wordpress.com

Guido Stepken

Re: IdentitySet but using #hash rather than #identityHash ?

Collision free hashing based on fibonacci pseudo primes? 25 years old ... :-)

On Dec 15, 2011 11:35 PM, "Mariano Martinez Peck" <[hidden email]> wrote:

Well...it is much slower :( it seems that the cost of (aKey identityHash + ( aKey mareaClass identityHash bitShift: 12) + (aKey basicSize bitShift: 24)
is bigger than the colisions.
Anyway, thanks for the nice thread. I learned.

Cheers

On Tue, Dec 13, 2011 at 8:51 PM, Mariano Martinez Peck <[hidden email]> wrote:
>
>
>
> On T...
--
Mariano
http://marianopeck.wordpress.com

Henrik Sperre Johansen

Re: IdentitySet but using #hash rather than #identityHash ?

In reply to this post by Mariano Martinez Peck

On 15.12.2011 23:35, Mariano Martinez Peck wrote:
> Well...it is much slower :( it seems that the cost of (aKey
> identityHash + ( aKey mareaClass identityHash bitShift: 12) + (aKey
> basicSize bitShift: 24)
> is bigger than the colisions.
> Anyway, thanks for the nice thread. I learned.
>
> Cheers
>
Well, first of all, that always creates a largeInteger, since
identityHash is already shifted by 22... You'd want to be using
basicIdentityHash.
2nd off, basicSize is useless for non-variable classes, which I guess is
the common case.
3rd, putting class hash in the high bits won't really help much if
you're serializing many instances of the same class, as they will all
occupy a sequential hash range.

So if I were you, I'd try with something like
obj basicIdentityHash << 12 + (obj class basicIdentityHash)
before discarding it outright due to computation cost.

Cheers,
Henry

Henrik Sperre Johansen

Re: IdentitySet but using #hash rather than #identityHash ?

In reply to this post by Mariano Martinez Peck

On 16.12.2011 00:40, Henrik Sperre Johansen wrote:

> On 15.12.2011 23:35, Mariano Martinez Peck wrote:
>> Well...it is much slower :( it seems that the cost of (aKey
>> identityHash + ( aKey mareaClass identityHash bitShift: 12) + (aKey
>> basicSize bitShift: 24)
>> is bigger than the colisions.
>> Anyway, thanks for the nice thread. I learned.
>>
>> Cheers
>>
> Well, first of all, that always creates a largeInteger, since
> identityHash is already shifted by 22... You'd want to be using
> basicIdentityHash.
> 2nd off, basicSize is useless for non-variable classes, which I guess
> is the common case.
> 3rd, putting class hash in the high bits won't really help much if
> you're serializing many instances of the same class, as they will all
> occupy a sequential hash range.
>
> So if I were you, I'd try with something like
> obj basicIdentityHash << 12 + (obj class basicIdentityHash)
> before discarding it outright due to computation cost.
>
> Cheers,
> Henry
>

Note: This is not portable to Squeak.
They kept the old version of identityHash (renamed to basicIdentityHash
in Pharo), and introduced scaledIdentityHash (doing the same as
identityHash in Pharo), due to backwards compatability concerns for
users explicitly relying on #identityHash returning a value in range [0,
4095].

Cheers,
Henry

Henrik Sperre Johansen

Re: IdentitySet but using #hash rather than #identityHash ?

In reply to this post by Mariano Martinez Peck

On 16.12.2011 00:40, Henrik Sperre Johansen wrote:

A few more points:
1) don't use PluggableSets. Block activations are costly, and you REALLY
need specialized data/hash functions for it to be worth using.
2) Even with a subclass of IdentitySet, it's hardly worth it:

PluggableSet:
n := 1000000.
set := PluggableSet new: n.
set hashBlock: [:obj | obj basicIdentityHash << 12 + (obj class
basicIdentityHash) ].
set equalBlock: [ :a :b | a == b ].
Time millisecondsToRun: [1 to: n do: [:waste | set add: Object new.]] 18828

The one I suggested in a custom subclass:
n := 1000000.
set := PseudoIdentitySet new: n.
Time millisecondsToRun: [1 to: n do: [:waste | set add: Object new.]]
14182

Usual identitySet:
n := 1000000.
set := IdentitySet new: n.
Time millisecondsToRun: [1 to: n do: [:waste | set add: Object new.]] 4595

Custom subclass with a slightly modified hash (anObject identityHash +
(anObject class basicIdentityHash bitShift: 6)):
n := 1000000.
set := PseudoIdentitySet2 new: n.
Time millisecondsToRun: [1 to: n do: [:waste | set add: Object new.]] 3694
Why 6 you ask? Because I'm worth it.

Not really enough to justify the extra compexity, imho.

Cheers,
Henry

Henrik Sperre Johansen

Re: IdentitySet but using #hash rather than #identityHash ?

In reply to this post by Mariano Martinez Peck

On 16.12.2011 01:06, Henrik Sperre Johansen wrote:

>
> Usual identitySet:
> n := 1000000.
> set := IdentitySet new: n.
> Time millisecondsToRun: [1 to: n do: [:waste | set add: Object new.]]
> 4595
>
> Custom subclass with a slightly modified hash (anObject identityHash +
> (anObject class basicIdentityHash bitShift: 6)):
> n := 1000000.
> set := PseudoIdentitySet2 new: n.
> Time millisecondsToRun: [1 to: n do: [:waste | set add: Object new.]]
> 3694
> Why 6 you ask? Because I'm worth it.
>
> Not really enough to justify the extra compexity, imho.
>
> Cheers,
> Henry
>

To put that a bit more in perspective, the extra cost for not
preallocating to a sane value:
n := 1000000.
set := IdentitySet new.
Time millisecondsToRun: [1 to: n do: [:waste | set add: Object new.]] 6872

Cheers,
Henry

Levente Uzonyi-2

Re: IdentitySet but using #hash rather than #identityHash ?

In reply to this post by Henrik Sperre Johansen

On Fri, 16 Dec 2011, Henrik Sperre Johansen wrote:

> On 16.12.2011 00:40, Henrik Sperre Johansen wrote:
>> On 15.12.2011 23:35, Mariano Martinez Peck wrote:
>>> Well...it is much slower :( it seems that the cost of (aKey identityHash
>>> + ( aKey mareaClass identityHash bitShift: 12) + (aKey basicSize bitShift:
>>> 24)
>>> is bigger than the colisions.
>>> Anyway, thanks for the nice thread. I learned.
>>>
>>> Cheers
>>>
>> Well, first of all, that always creates a largeInteger, since identityHash
>> is already shifted by 22... You'd want to be using basicIdentityHash.
>> 2nd off, basicSize is useless for non-variable classes, which I guess is
>> the common case.
>> 3rd, putting class hash in the high bits won't really help much if you're
>> serializing many instances of the same class, as they will all occupy a
>> sequential hash range.
>>
>> So if I were you, I'd try with something like
>> obj basicIdentityHash << 12 + (obj class basicIdentityHash)
>> before discarding it outright due to computation cost.
>>
>> Cheers,
>> Henry
>>
> A few more points:
> 1) don't use PluggableSets. Block activations are costly, and you REALLY need
> specialized data/hash functions for it to be worth using.
> 2) Even with a subclass of IdentitySet, it's hardly worth it:
>
> PluggableSet:
> n := 1000000.
> set := PluggableSet new: n.
> set hashBlock: [:obj | obj basicIdentityHash << 12 + (obj class
> basicIdentityHash) ].
> set equalBlock: [ :a :b | a == b ].
> Time millisecondsToRun: [1 to: n do: [:waste | set add: Object new.]] 18828
>
> The one I suggested in a custom subclass:
> n := 1000000.
> set := PseudoIdentitySet new: n.
> Time millisecondsToRun: [1 to: n do: [:waste | set add: Object new.]] 14182
>
> Usual identitySet:
> n := 1000000.
> set := IdentitySet new: n.
> Time millisecondsToRun: [1 to: n do: [:waste | set add: Object new.]] 4595
>
> Custom subclass with a slightly modified hash (anObject identityHash +
> (anObject class basicIdentityHash bitShift: 6)):
> n := 1000000.
> set := PseudoIdentitySet2 new: n.
> Time millisecondsToRun: [1 to: n do: [:waste | set add: Object new.]] 3694
> Why 6 you ask? Because I'm worth it.
>
> Not really enough to justify the extra compexity, imho.

How about my numbers? :)

"Preallocate objects, so we won't count gc time."
n := 1000000.
objects := Array new: n streamContents: [ :stream |
n timesRepeat: [ stream nextPut: Object new ] ].

set := IdentitySet new: n.
Smalltalk garbageCollect.
[1 to: n do: [ :i | set add: (objects at: i) ] ] timeToRun. "4949"

set := LargeIdentitySet new.
Smalltalk garbageCollect.
[1 to: n do: [ :i | set add: (objects at: i) ] ] timeToRun. "331"

set := (PluggableSet new: n)
hashBlock: [ :object | object identityHash * 4096 + object class identityHash * 64 ]; "Change this to #basicIdentityHash in Pharo"
equalBlock: [ :a :b | a == b ];
yourself.
Smalltalk garbageCollect.
[1 to: n do: [ :i | set add: (objects at: i) ] ] timeToRun. "5511"

I also have a LargeIdentityDictionary, which is relatively fast, but not
as fast as LargeIdentitySet, because (for some unknown reason) we don't
have a primitive that could support it. If we had a primitive like
primitive 132 which would return the index of the element if found or 0
if not, then we could have a really fast LargeIdentityDictionary.

Levente

>
> Cheers,
> Henry
>
>

1234