IdentitySet but using #hash rather than #identityHash ?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
62 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

IdentitySet but using #hash rather than #identityHash ?

Mariano Martinez Peck
Hi guys. I hope this is not a very stupid question. Background: in Fuel we have a IdentityDictionary where we put each of the objects we find while traversing the graph. We need to use IdentitySet because we cannot have repetitions (and to avoid loops) so we NEED to use #==. In such dictionary we put ALL objects of the graph, so it can be very big. Since IdentitySet uses #identityHash, it means it will be using those ONLY 12 bits in the object header. It means that we have 2^12 = 4096  different values. 

Question:  having explained the previous, I wanted to be able to use #hash rather than #identityHash since several classes implement #hash and hence I thought that using #hash I could have less colisions and hence a better performance. I tried to make a subclass of IdentitySet that uses #hash rather than #identityHash but my image freezes. I also tried something like:

set := PluggableSet new.
    set hashBlock: [ :elem | elem hash ].
    set equalBlock: [ :a :b | a == b ].

But it doesn't work either. I works with simple tests in a workspace but when I run the full tests of Fuel, it enters in a loop in the method #hash of Collection..

Anyway, my question is, should that work? if not, what is the exact reason?

Thanks in advance,

--
Mariano
http://marianopeck.wordpress.com

Reply | Threaded
Open this post in threaded view
|

Re: IdentitySet but using #hash rather than #identityHash ?

Levente Uzonyi-2
On Mon, 12 Dec 2011, Mariano Martinez Peck wrote:

> Hi guys. I hope this is not a very stupid question. Background: in Fuel we
> have a IdentityDictionary where we put each of the objects we find while
> traversing the graph. We need to use IdentitySet because we cannot have
> repetitions (and to avoid loops) so we NEED to use #==. In such dictionary
> we put ALL objects of the graph, so it can be very big. Since IdentitySet
> uses #identityHash, it means it will be using those ONLY 12 bits in the
> object header. It means that we have 2^12 = 4096  different values.
>
> Question:  having explained the previous, I wanted to be able to use #hash
> rather than #identityHash since several classes implement #hash and hence I
> thought that using #hash I could have less colisions and hence a better
> performance. I tried to make a subclass of IdentitySet that uses #hash
> rather than #identityHash but my image freezes. I also tried something like:
>
> set := PluggableSet new.
>    set hashBlock: [ :elem | elem hash ].
>    set equalBlock: [ :a :b | a == b ].
>
> But it doesn't work either. I works with simple tests in a workspace but
> when I run the full tests of Fuel, it enters in a loop in the method #hash
> of Collection..
>
> Anyway, my question is, should that work? if not, what is the exact reason?

If the serialized objects remain unchanged, then it can work. You can also
give another IdentitySet implementation a try, e.g.:
http://leves.web.elte.hu/squeak/LargeIdentitySet.st .


Levente

>
> Thanks in advance,
>
> --
> Mariano
> http://marianopeck.wordpress.com
>

Reply | Threaded
Open this post in threaded view
|

Re: IdentitySet but using #hash rather than #identityHash ?

Levente Uzonyi-2
On Mon, 12 Dec 2011, Levente Uzonyi wrote:

> On Mon, 12 Dec 2011, Mariano Martinez Peck wrote:
>
>> Hi guys. I hope this is not a very stupid question. Background: in Fuel we
>> have a IdentityDictionary where we put each of the objects we find while
>> traversing the graph. We need to use IdentitySet because we cannot have
>> repetitions (and to avoid loops) so we NEED to use #==. In such dictionary
>> we put ALL objects of the graph, so it can be very big. Since IdentitySet
>> uses #identityHash, it means it will be using those ONLY 12 bits in the
>> object header. It means that we have 2^12 = 4096  different values.
>>
>> Question:  having explained the previous, I wanted to be able to use #hash
>> rather than #identityHash since several classes implement #hash and hence I
>> thought that using #hash I could have less colisions and hence a better
>> performance. I tried to make a subclass of IdentitySet that uses #hash
>> rather than #identityHash but my image freezes. I also tried something
>> like:
>>
>> set := PluggableSet new.
>>    set hashBlock: [ :elem | elem hash ].
>>    set equalBlock: [ :a :b | a == b ].
>>
>> But it doesn't work either. I works with simple tests in a workspace but
>> when I run the full tests of Fuel, it enters in a loop in the method #hash
>> of Collection..
>>
>> Anyway, my question is, should that work? if not, what is the exact reason?
>
> If the serialized objects remain unchanged, then it can work. You can also
> give another IdentitySet implementation a try, e.g.:
> http://leves.web.elte.hu/squeak/LargeIdentitySet.st .

You can also take a look at SystemTracer, because (IIRC) it uses an
IdentitySet which will contain all objects at a point, and it's reasonably
fast. IIRC the idea is to use the identityHash of the object mixed with
the hash of its class to increase the number of bits in the key.


Levente

>
>
> Levente
>
>>
>> Thanks in advance,
>>
>> --
>> Mariano
>> http://marianopeck.wordpress.com
>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: IdentitySet but using #hash rather than #identityHash ?

carlo.t
In reply to this post by Mariano Martinez Peck
Hi

Wouldn't the fact that you use hash cause potential loops now? e.g. collection refers to another object that refers to first collection. --> aCollection>>hash references an item which causes this current collection's hash to be called again?

identityHash is deterministic in this case.

Does this help?
Cheers
Carlo

On 12 Dec 2011, at 10:58 AM, Mariano Martinez Peck wrote:

Hi guys. I hope this is not a very stupid question. Background: in Fuel we have a IdentityDictionary where we put each of the objects we find while traversing the graph. We need to use IdentitySet because we cannot have repetitions (and to avoid loops) so we NEED to use #==. In such dictionary we put ALL objects of the graph, so it can be very big. Since IdentitySet uses #identityHash, it means it will be using those ONLY 12 bits in the object header. It means that we have 2^12 = 4096  different values. 

Question:  having explained the previous, I wanted to be able to use #hash rather than #identityHash since several classes implement #hash and hence I thought that using #hash I could have less colisions and hence a better performance. I tried to make a subclass of IdentitySet that uses #hash rather than #identityHash but my image freezes. I also tried something like:

set := PluggableSet new.
    set hashBlock: [ :elem | elem hash ].
    set equalBlock: [ :a :b | a == b ].

But it doesn't work either. I works with simple tests in a workspace but when I run the full tests of Fuel, it enters in a loop in the method #hash of Collection..

Anyway, my question is, should that work? if not, what is the exact reason?

Thanks in advance,

--
Mariano
http://marianopeck.wordpress.com


Reply | Threaded
Open this post in threaded view
|

Re: IdentitySet but using #hash rather than #identityHash ?

Mariano Martinez Peck


On Mon, Dec 12, 2011 at 1:56 PM, Carlo <[hidden email]> wrote:
Hi

Wouldn't the fact that you use hash cause potential loops now? e.g. collection refers to another object that refers to first collection. --> aCollection>>hash references an item which causes this current collection's hash to be called again?


Hi Carlo. I am still newbie with Collections but I think I am having exactly that problem. During my tests, it loops in Collection >> #hash  when sending #hash to its elements.
Sorry, but I couldn't undertand what is the cause of the problem?  why it doesn't work while it does using #identityHash?  could you elaborate?

thanks
 
identityHash is deterministic in this case.

Does this help?
Cheers
Carlo

On 12 Dec 2011, at 10:58 AM, Mariano Martinez Peck wrote:

Hi guys. I hope this is not a very stupid question. Background: in Fuel we have a IdentityDictionary where we put each of the objects we find while traversing the graph. We need to use IdentitySet because we cannot have repetitions (and to avoid loops) so we NEED to use #==. In such dictionary we put ALL objects of the graph, so it can be very big. Since IdentitySet uses #identityHash, it means it will be using those ONLY 12 bits in the object header. It means that we have 2^12 = 4096  different values. 

Question:  having explained the previous, I wanted to be able to use #hash rather than #identityHash since several classes implement #hash and hence I thought that using #hash I could have less colisions and hence a better performance. I tried to make a subclass of IdentitySet that uses #hash rather than #identityHash but my image freezes. I also tried something like:

set := PluggableSet new.
    set hashBlock: [ :elem | elem hash ].
    set equalBlock: [ :a :b | a == b ].

But it doesn't work either. I works with simple tests in a workspace but when I run the full tests of Fuel, it enters in a loop in the method #hash of Collection..

Anyway, my question is, should that work? if not, what is the exact reason?

Thanks in advance,

--
Mariano
http://marianopeck.wordpress.com





--
Mariano
http://marianopeck.wordpress.com

Reply | Threaded
Open this post in threaded view
|

Re: IdentitySet but using #hash rather than #identityHash ?

Mariano Martinez Peck
In reply to this post by Levente Uzonyi-2
Thanks Levente. I will try both. Actually, I would also need a fast (or at least I would like to use #hash) IdentityDictionary apart from a IdentitySet ;)

On Mon, Dec 12, 2011 at 1:52 PM, Levente Uzonyi <[hidden email]> wrote:
On Mon, 12 Dec 2011, Levente Uzonyi wrote:

On Mon, 12 Dec 2011, Mariano Martinez Peck wrote:

Hi guys. I hope this is not a very stupid question. Background: in Fuel we
have a IdentityDictionary where we put each of the objects we find while
traversing the graph. We need to use IdentitySet because we cannot have
repetitions (and to avoid loops) so we NEED to use #==. In such dictionary
we put ALL objects of the graph, so it can be very big. Since IdentitySet
uses #identityHash, it means it will be using those ONLY 12 bits in the
object header. It means that we have 2^12 = 4096  different values.

Question:  having explained the previous, I wanted to be able to use #hash
rather than #identityHash since several classes implement #hash and hence I
thought that using #hash I could have less colisions and hence a better
performance. I tried to make a subclass of IdentitySet that uses #hash
rather than #identityHash but my image freezes. I also tried something like:

set := PluggableSet new.
  set hashBlock: [ :elem | elem hash ].
  set equalBlock: [ :a :b | a == b ].

But it doesn't work either. I works with simple tests in a workspace but
when I run the full tests of Fuel, it enters in a loop in the method #hash
of Collection..

Anyway, my question is, should that work? if not, what is the exact reason?

If the serialized objects remain unchanged, then it can work. You can also give another IdentitySet implementation a try, e.g.:
http://leves.web.elte.hu/squeak/LargeIdentitySet.st .

You can also take a look at SystemTracer, because (IIRC) it uses an IdentitySet which will contain all objects at a point, and it's reasonably fast. IIRC the idea is to use the identityHash of the object mixed with the hash of its class to increase the number of bits in the key.


Levente




Levente


Thanks in advance,

--
Mariano
http://marianopeck.wordpress.com







--
Mariano
http://marianopeck.wordpress.com

Reply | Threaded
Open this post in threaded view
|

Re: IdentitySet but using #hash rather than #identityHash ?

Mariano Martinez Peck
In reply to this post by Mariano Martinez Peck


On Tue, Dec 13, 2011 at 12:32 AM, Mariano Martinez Peck <[hidden email]> wrote:


On Mon, Dec 12, 2011 at 1:56 PM, Carlo <[hidden email]> wrote:
Hi

Wouldn't the fact that you use hash cause potential loops now? e.g. collection refers to another object that refers to first collection. --> aCollection>>hash references an item which causes this current collection's hash to be called again?


Hi Carlo. I am still newbie with Collections but I think I am having exactly that problem. During my tests, it loops in Collection >> #hash  when sending #hash to its elements.
Sorry, but I couldn't undertand what is the cause of the problem?  why it doesn't work while it does using #identityHash?  could you elaborate?


Well, now I understood, and I understand also why it doesn't happen with #identityHash. But what happens then with regular Dictionaries using #hash? why it doesn't happen there?
 
thanks
 
identityHash is deterministic in this case.

Does this help?
Cheers
Carlo

On 12 Dec 2011, at 10:58 AM, Mariano Martinez Peck wrote:

Hi guys. I hope this is not a very stupid question. Background: in Fuel we have a IdentityDictionary where we put each of the objects we find while traversing the graph. We need to use IdentitySet because we cannot have repetitions (and to avoid loops) so we NEED to use #==. In such dictionary we put ALL objects of the graph, so it can be very big. Since IdentitySet uses #identityHash, it means it will be using those ONLY 12 bits in the object header. It means that we have 2^12 = 4096  different values. 

Question:  having explained the previous, I wanted to be able to use #hash rather than #identityHash since several classes implement #hash and hence I thought that using #hash I could have less colisions and hence a better performance. I tried to make a subclass of IdentitySet that uses #hash rather than #identityHash but my image freezes. I also tried something like:

set := PluggableSet new.
    set hashBlock: [ :elem | elem hash ].
    set equalBlock: [ :a :b | a == b ].

But it doesn't work either. I works with simple tests in a workspace but when I run the full tests of Fuel, it enters in a loop in the method #hash of Collection..

Anyway, my question is, should that work? if not, what is the exact reason?

Thanks in advance,

--
Mariano
http://marianopeck.wordpress.com





--
Mariano
http://marianopeck.wordpress.com




--
Mariano
http://marianopeck.wordpress.com

Reply | Threaded
Open this post in threaded view
|

Re: IdentitySet but using #hash rather than #identityHash ?

carlo.t
Hi Mariano

I'm no expert either ;)
Without having access to the exact code it would look like either you have a collection that references itself (which would break all collection implementations) or maybe the tests have just slowed down to the point where you think it's 'crashed'.
Do you have anymore info or perhaps which methods you changed on IdentitySet?

Cheers
Carlo

On 13 Dec 2011, at 1:57 AM, Mariano Martinez Peck wrote:



On Tue, Dec 13, 2011 at 12:32 AM, Mariano Martinez Peck <[hidden email]> wrote:


On Mon, Dec 12, 2011 at 1:56 PM, Carlo <[hidden email]> wrote:
Hi

Wouldn't the fact that you use hash cause potential loops now? e.g. collection refers to another object that refers to first collection. --> aCollection>>hash references an item which causes this current collection's hash to be called again?


Hi Carlo. I am still newbie with Collections but I think I am having exactly that problem. During my tests, it loops in Collection >> #hash  when sending #hash to its elements.
Sorry, but I couldn't undertand what is the cause of the problem?  why it doesn't work while it does using #identityHash?  could you elaborate?


Well, now I understood, and I understand also why it doesn't happen with #identityHash. But what happens then with regular Dictionaries using #hash? why it doesn't happen there?
 
thanks
 
identityHash is deterministic in this case.

Does this help?
Cheers
Carlo

On 12 Dec 2011, at 10:58 AM, Mariano Martinez Peck wrote:

Hi guys. I hope this is not a very stupid question. Background: in Fuel we have a IdentityDictionary where we put each of the objects we find while traversing the graph. We need to use IdentitySet because we cannot have repetitions (and to avoid loops) so we NEED to use #==. In such dictionary we put ALL objects of the graph, so it can be very big. Since IdentitySet uses #identityHash, it means it will be using those ONLY 12 bits in the object header. It means that we have 2^12 = 4096  different values. 

Question:  having explained the previous, I wanted to be able to use #hash rather than #identityHash since several classes implement #hash and hence I thought that using #hash I could have less colisions and hence a better performance. I tried to make a subclass of IdentitySet that uses #hash rather than #identityHash but my image freezes. I also tried something like:

set := PluggableSet new.
    set hashBlock: [ :elem | elem hash ].
    set equalBlock: [ :a :b | a == b ].

But it doesn't work either. I works with simple tests in a workspace but when I run the full tests of Fuel, it enters in a loop in the method #hash of Collection..

Anyway, my question is, should that work? if not, what is the exact reason?

Thanks in advance,

--
Mariano
http://marianopeck.wordpress.com





--
Mariano
http://marianopeck.wordpress.com




--
Mariano
http://marianopeck.wordpress.com


Reply | Threaded
Open this post in threaded view
|

Re: IdentitySet but using #hash rather than #identityHash ?

Michael Roberts-2
Hi Mariano, when I read this thread I was a bit confused that you wanted an IdentitySet that used #hash. From that statement it sounded like you just wanted a Set. This would allow any object to define its own hash and importantly what equality means with #=. So if you want to delegate that to the object use a Set.

However as the thread has gone on perhaps you want the identity relationship but you just wanted a bigger identity hash space?

IdentitySets are fast because they bypass any delegation. Once you have seen the object in your traversal (common usage pattern) that's it. You grab it's identity which is a pretty low level thing to do.

As for what happens with collections who knows. Depends. Relying on identity set semantics for a collection is easy. Set semantics is not so. Remember that both hash and equals are important to know if the set already contains the element. Depending on the collection implementation both of these could be composite in terms of the parts. Who knows where you end or how long it takes. I.e. if it is a function of the size of the collection and further collections are composite....

Cheers
Mike


On Tuesday, December 13, 2011, Carlo <[hidden email]> wrote:
> Hi Mariano
> I'm no expert either ;)
> Without having access to the exact code it would look like either you have a collection that references itself (which would break all collection implementations) or maybe the tests have just slowed down to the point where you think it's 'crashed'.
> Do you have anymore info or perhaps which methods you changed on IdentitySet?
> Cheers
> Carlo
> On 13 Dec 2011, at 1:57 AM, Mariano Martinez Peck wrote:
>
>
> On Tue, Dec 13, 2011 at 12:32 AM, Mariano Martinez Peck <[hidden email]> wrote:
>>
>>
>> On Mon, Dec 12, 2011 at 1:56 PM, Carlo <[hidden email]> wrote:
>>>
>>> Hi
>>> Wouldn't the fact that you use hash cause potential loops now? e.g. collection refers to another object that refers to first collection. --> aCollection>>hash references an item which causes this current collection's hash to be called again?
>>
>> Hi Carlo. I am still newbie with Collections but I think I am having exactly that problem. During my tests, it loops in Collection >> #hash  when sending #hash to its elements.
>> Sorry, but I couldn't undertand what is the cause of the problem?  why it doesn't work while it does using #identityHash?  could you elaborate?
>>
>
> Well, now I understood, and I understand also why it doesn't happen with #identityHash. But what happens then with regular Dictionaries using #hash? why it doesn't happen there?
>  
>>
>> thanks
>>  
>>>
>>> identityHash is deterministic in this case.
>>> Does this help?
>>> Cheers
>>> Carlo
>>> On 12 Dec 2011, at 10:58 AM, Mariano Martinez Peck wrote:
>>> Hi guys. I hope this is not a very stupid question. Background: in Fuel we have a IdentityDictionary where we put each of the objects we find while traversing the graph. We need to use IdentitySet because we cannot have repetitions (and to avoid loops) so we NEED to use #==. In such dictionary we put ALL objects of the graph, so it can be very big. Since IdentitySet uses #identityHash, it means it will be using those ONLY 12 bits in the object header. It means that we have 2^12 = 4096  different values. 
>>>
>>> Question:  having explained the previous, I wanted to be able to use #hash rather than #identityHash since several classes implement #hash and hence I thought that using #hash I could have less colisions and hence a better performance. I tried to make a subclass of IdentitySet that uses #hash rather than #identityHash but my image freezes. I also tried something like:
>>>
>>> set := PluggableSet new.
>>>     set hashBlock: [ :elem | elem hash ].
>>>     set equalBlock: [ :a :b | a == b ].
>>>
>>> But it doesn't work either. I works with simple tests in a workspace but when I run the full tests of Fuel, it enters in a loop in the method #hash of Collection..
>>>
>>> Anyway, my question is, should that work? if not, what is the exact reason?
>>>
>>> Thanks in advance,
>>>
>>> --
>>> Mariano
>>> http://marianopeck.wordpress.com
>>>

>>>
>>
>>
>>
>> --
>> Mariano
>> http://marianopeck.wordpress.com
>>
>
>
>
> --
> Mariano
> http://marianopeck.wordpress.com
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: IdentitySet but using #hash rather than #identityHash ?

Mariano Martinez Peck


On Tue, Dec 13, 2011 at 8:43 AM, Michael Roberts <[hidden email]> wrote:
Hi Mariano, when I read this thread I was a bit confused that you wanted an IdentitySet that used #hash. From that statement it sounded like you just wanted a Set. This would allow any object to define its own hash and importantly what equality means with #=. So if you want to delegate that to the object use a Set.


No, I cannot use a Set because I cannot have repetitions. This that I am asking is what we do in Fuel serializer while traversing the object graph. Each analyzed object is put in an IdentityDictionary (but the question is the same for IdentitySet) as key. To avoid cycles, I need to put each object only once. Since graphs can be very big, such dict could have lots of objects.
 
However as the thread has gone on perhaps you want the identity relationship but you just wanted a bigger identity hash space?

Yes, exactly. I were thinking if there could be a way (maybe...that's why I am asking) of improve its performance considering that I could have much more objects than 2^13. In other words, I wanted to see if I could avoid colisions.
 

IdentitySets are fast because they bypass any delegation. Once you have seen the object in your traversal (common usage pattern) that's it. You grab it's identity which is a pretty low level thing to do.

Ok...it is a tradeoff. If I use #identityHash it is fast because there is no delegation and it is almost an immediate primitive. But I gues it will be slow if there are lots of colisions. Not using #identityHash but something else maybe could decrease maybe the amount of colisions, but maybe with the delegation it will gets slower.

I will try with Levente idea of what it is done in SystemTracer: use as a hash the identityHash of the object mixed with the identityHash of its class. Maybe that decreases the colisions and at the same time I don't pay delegation (#class is special bytecode, so nothing, and ok..there are 2 sends to #identityhash but I don't thinnk it is that much).

Anyway, thanks for the interesting post, I always learn :)
 

As for what happens with collections who knows. Depends. Relying on identity set semantics for a collection is easy. Set semantics is not so. Remember that both hash and equals are important to know if the set already contains the element. Depending on the collection implementation both of these could be composite in terms of the parts. Who knows where you end or how long it takes. I.e. if it is a function of the size of the collection and further collections are composite....


yes...
 
Cheers
Mike



On Tuesday, December 13, 2011, Carlo <[hidden email]> wrote:
> Hi Mariano
> I'm no expert either ;)
> Without having access to the exact code it would look like either you have a collection that references itself (which would break all collection implementations) or maybe the tests have just slowed down to the point where you think it's 'crashed'.
> Do you have anymore info or perhaps which methods you changed on IdentitySet?
> Cheers
> Carlo
> On 13 Dec 2011, at 1:57 AM, Mariano Martinez Peck wrote:
>
>
> On Tue, Dec 13, 2011 at 12:32 AM, Mariano Martinez Peck <[hidden email]> wrote:
>>
>>
>> On Mon, Dec 12, 2011 at 1:56 PM, Carlo <[hidden email]> wrote:
>>>
>>> Hi
>>> Wouldn't the fact that you use hash cause potential loops now? e.g. collection refers to another object that refers to first collection. --> aCollection>>hash references an item which causes this current collection's hash to be called again?
>>
>> Hi Carlo. I am still newbie with Collections but I think I am having exactly that problem. During my tests, it loops in Collection >> #hash  when sending #hash to its elements.
>> Sorry, but I couldn't undertand what is the cause of the problem?  why it doesn't work while it does using #identityHash?  could you elaborate?
>>
>
> Well, now I understood, and I understand also why it doesn't happen with #identityHash. But what happens then with regular Dictionaries using #hash? why it doesn't happen there?
>  
>>
>> thanks
>>  
>>>
>>> identityHash is deterministic in this case.
>>> Does this help?
>>> Cheers
>>> Carlo
>>> On 12 Dec 2011, at 10:58 AM, Mariano Martinez Peck wrote:
>>> Hi guys. I hope this is not a very stupid question. Background: in Fuel we have a IdentityDictionary where we put each of the objects we find while traversing the graph. We need to use IdentitySet because we cannot have repetitions (and to avoid loops) so we NEED to use #==. In such dictionary we put ALL objects of the graph, so it can be very big. Since IdentitySet uses #identityHash, it means it will be using those ONLY 12 bits in the object header. It means that we have 2^12 = 4096  different values. 
>>>
>>> Question:  having explained the previous, I wanted to be able to use #hash rather than #identityHash since several classes implement #hash and hence I thought that using #hash I could have less colisions and hence a better performance. I tried to make a subclass of IdentitySet that uses #hash rather than #identityHash but my image freezes. I also tried something like:
>>>
>>> set := PluggableSet new.
>>>     set hashBlock: [ :elem | elem hash ].
>>>     set equalBlock: [ :a :b | a == b ].
>>>
>>> But it doesn't work either. I works with simple tests in a workspace but when I run the full tests of Fuel, it enters in a loop in the method #hash of Collection..
>>>
>>> Anyway, my question is, should that work? if not, what is the exact reason?
>>>
>>> Thanks in advance,
>>>
>>> --
>>> Mariano
>>> http://marianopeck.wordpress.com
>>>
>>>
>>
>>
>>
>> --
>> Mariano
>> http://marianopeck.wordpress.com
>>
>
>
>
> --
> Mariano
> http://marianopeck.wordpress.com
>
>
>



--
Mariano
http://marianopeck.wordpress.com

Reply | Threaded
Open this post in threaded view
|

Re: IdentitySet but using #hash rather than #identityHash ?

Michael Roberts-2
Ok I see. Have you measured the collisions on a very big graph? What stats do you get?

Cheers Mike

On 13 Dec 2011, at 08:37, Mariano Martinez Peck <[hidden email]> wrote:



On Tue, Dec 13, 2011 at 8:43 AM, Michael Roberts <[hidden email]> wrote:
Hi Mariano, when I read this thread I was a bit confused that you wanted an IdentitySet that used #hash. From that statement it sounded like you just wanted a Set. This would allow any object to define its own hash and importantly what equality means with #=. So if you want to delegate that to the object use a Set.


No, I cannot use a Set because I cannot have repetitions. This that I am asking is what we do in Fuel serializer while traversing the object graph. Each analyzed object is put in an IdentityDictionary (but the question is the same for IdentitySet) as key. To avoid cycles, I need to put each object only once. Since graphs can be very big, such dict could have lots of objects.
 
However as the thread has gone on perhaps you want the identity relationship but you just wanted a bigger identity hash space?

Yes, exactly. I were thinking if there could be a way (maybe...that's why I am asking) of improve its performance considering that I could have much more objects than 2^13. In other words, I wanted to see if I could avoid colisions.
 

IdentitySets are fast because they bypass any delegation. Once you have seen the object in your traversal (common usage pattern) that's it. You grab it's identity which is a pretty low level thing to do.

Ok...it is a tradeoff. If I use #identityHash it is fast because there is no delegation and it is almost an immediate primitive. But I gues it will be slow if there are lots of colisions. Not using #identityHash but something else maybe could decrease maybe the amount of colisions, but maybe with the delegation it will gets slower.

I will try with Levente idea of what it is done in SystemTracer: use as a hash the identityHash of the object mixed with the identityHash of its class. Maybe that decreases the colisions and at the same time I don't pay delegation (#class is special bytecode, so nothing, and ok..there are 2 sends to #identityhash but I don't thinnk it is that much).

Anyway, thanks for the interesting post, I always learn :)
 

As for what happens with collections who knows. Depends. Relying on identity set semantics for a collection is easy. Set semantics is not so. Remember that both hash and equals are important to know if the set already contains the element. Depending on the collection implementation both of these could be composite in terms of the parts. Who knows where you end or how long it takes. I.e. if it is a function of the size of the collection and further collections are composite....


yes...
 
Cheers
Mike



On Tuesday, December 13, 2011, Carlo <[hidden email]> wrote:
> Hi Mariano
> I'm no expert either ;)
> Without having access to the exact code it would look like either you have a collection that references itself (which would break all collection implementations) or maybe the tests have just slowed down to the point where you think it's 'crashed'.
> Do you have anymore info or perhaps which methods you changed on IdentitySet?
> Cheers
> Carlo
> On 13 Dec 2011, at 1:57 AM, Mariano Martinez Peck wrote:
>
>
> On Tue, Dec 13, 2011 at 12:32 AM, Mariano Martinez Peck <[hidden email]> wrote:
>>
>>
>> On Mon, Dec 12, 2011 at 1:56 PM, Carlo <[hidden email]> wrote:
>>>
>>> Hi
>>> Wouldn't the fact that you use hash cause potential loops now? e.g. collection refers to another object that refers to first collection. --> aCollection>>hash references an item which causes this current collection's hash to be called again?
>>
>> Hi Carlo. I am still newbie with Collections but I think I am having exactly that problem. During my tests, it loops in Collection >> #hash  when sending #hash to its elements.
>> Sorry, but I couldn't undertand what is the cause of the problem?  why it doesn't work while it does using #identityHash?  could you elaborate?
>>
>
> Well, now I understood, and I understand also why it doesn't happen with #identityHash. But what happens then with regular Dictionaries using #hash? why it doesn't happen there?
>  
>>
>> thanks
>>  
>>>
>>> identityHash is deterministic in this case.
>>> Does this help?
>>> Cheers
>>> Carlo
>>> On 12 Dec 2011, at 10:58 AM, Mariano Martinez Peck wrote:
>>> Hi guys. I hope this is not a very stupid question. Background: in Fuel we have a IdentityDictionary where we put each of the objects we find while traversing the graph. We need to use IdentitySet because we cannot have repetitions (and to avoid loops) so we NEED to use #==. In such dictionary we put ALL objects of the graph, so it can be very big. Since IdentitySet uses #identityHash, it means it will be using those ONLY 12 bits in the object header. It means that we have 2^12 = 4096  different values. 
>>>
>>> Question:  having explained the previous, I wanted to be able to use #hash rather than #identityHash since several classes implement #hash and hence I thought that using #hash I could have less colisions and hence a better performance. I tried to make a subclass of IdentitySet that uses #hash rather than #identityHash but my image freezes. I also tried something like:
>>>
>>> set := PluggableSet new.
>>>     set hashBlock: [ :elem | elem hash ].
>>>     set equalBlock: [ :a :b | a == b ].
>>>
>>> But it doesn't work either. I works with simple tests in a workspace but when I run the full tests of Fuel, it enters in a loop in the method #hash of Collection..
>>>
>>> Anyway, my question is, should that work? if not, what is the exact reason?
>>>
>>> Thanks in advance,
>>>
>>> --
>>> Mariano
>>> http://marianopeck.wordpress.com
>>>
>>>
>>
>>
>>
>> --
>> Mariano
>> http://marianopeck.wordpress.com
>>
>
>
>
> --
> Mariano
> http://marianopeck.wordpress.com
>
>
>



--
Mariano
http://marianopeck.wordpress.com

Reply | Threaded
Open this post in threaded view
|

Re: IdentitySet but using #hash rather than #identityHash ?

Eliot Miranda-2
In reply to this post by Mariano Martinez Peck
Hi Mariano,

On Tue, Dec 13, 2011 at 12:37 AM, Mariano Martinez Peck <[hidden email]> wrote:


On Tue, Dec 13, 2011 at 8:43 AM, Michael Roberts <[hidden email]> wrote:
Hi Mariano, when I read this thread I was a bit confused that you wanted an IdentitySet that used #hash. From that statement it sounded like you just wanted a Set. This would allow any object to define its own hash and importantly what equality means with #=. So if you want to delegate that to the object use a Set.


No, I cannot use a Set because I cannot have repetitions. This that I am asking is what we do in Fuel serializer while traversing the object graph. Each analyzed object is put in an IdentityDictionary (but the question is the same for IdentitySet) as key. To avoid cycles, I need to put each object only once. Since graphs can be very big, such dict could have lots of objects.
 
However as the thread has gone on perhaps you want the identity relationship but you just wanted a bigger identity hash space?

Yes, exactly. I were thinking if there could be a way (maybe...that's why I am asking) of improve its performance considering that I could have much more objects than 2^13. In other words, I wanted to see if I could avoid colisions.

You can assume that certain properties of objects will not change during serialization, for example the class of objects, the basic size of objects.  So you can construct a valid extended identity hash from these properties.  For example

fuelSerializationHash
    ^self identityHash + (self class identityHash bitShift: 12) + (self basicSize bitShift: 24)

In general this idea may not work because of meta-primitives like changeClassTo:, which would change the hash.  But in Fuel's case I think it's safe to assume that objects won't change class or size during serialization.

HTH

 
 

IdentitySets are fast because they bypass any delegation. Once you have seen the object in your traversal (common usage pattern) that's it. You grab it's identity which is a pretty low level thing to do.

Ok...it is a tradeoff. If I use #identityHash it is fast because there is no delegation and it is almost an immediate primitive. But I gues it will be slow if there are lots of colisions. Not using #identityHash but something else maybe could decrease maybe the amount of colisions, but maybe with the delegation it will gets slower.

I will try with Levente idea of what it is done in SystemTracer: use as a hash the identityHash of the object mixed with the identityHash of its class. Maybe that decreases the colisions and at the same time I don't pay delegation (#class is special bytecode, so nothing, and ok..there are 2 sends to #identityhash but I don't thinnk it is that much).

Anyway, thanks for the interesting post, I always learn :)
 

As for what happens with collections who knows. Depends. Relying on identity set semantics for a collection is easy. Set semantics is not so. Remember that both hash and equals are important to know if the set already contains the element. Depending on the collection implementation both of these could be composite in terms of the parts. Who knows where you end or how long it takes. I.e. if it is a function of the size of the collection and further collections are composite....


yes...
 
Cheers
Mike



On Tuesday, December 13, 2011, Carlo <[hidden email]> wrote:
> Hi Mariano
> I'm no expert either ;)
> Without having access to the exact code it would look like either you have a collection that references itself (which would break all collection implementations) or maybe the tests have just slowed down to the point where you think it's 'crashed'.
> Do you have anymore info or perhaps which methods you changed on IdentitySet?
> Cheers
> Carlo
> On 13 Dec 2011, at 1:57 AM, Mariano Martinez Peck wrote:
>
>
> On Tue, Dec 13, 2011 at 12:32 AM, Mariano Martinez Peck <[hidden email]> wrote:
>>
>>
>> On Mon, Dec 12, 2011 at 1:56 PM, Carlo <[hidden email]> wrote:
>>>
>>> Hi
>>> Wouldn't the fact that you use hash cause potential loops now? e.g. collection refers to another object that refers to first collection. --> aCollection>>hash references an item which causes this current collection's hash to be called again?
>>
>> Hi Carlo. I am still newbie with Collections but I think I am having exactly that problem. During my tests, it loops in Collection >> #hash  when sending #hash to its elements.
>> Sorry, but I couldn't undertand what is the cause of the problem?  why it doesn't work while it does using #identityHash?  could you elaborate?
>>
>
> Well, now I understood, and I understand also why it doesn't happen with #identityHash. But what happens then with regular Dictionaries using #hash? why it doesn't happen there?
>  
>>
>> thanks
>>  
>>>
>>> identityHash is deterministic in this case.
>>> Does this help?
>>> Cheers
>>> Carlo
>>> On 12 Dec 2011, at 10:58 AM, Mariano Martinez Peck wrote:
>>> Hi guys. I hope this is not a very stupid question. Background: in Fuel we have a IdentityDictionary where we put each of the objects we find while traversing the graph. We need to use IdentitySet because we cannot have repetitions (and to avoid loops) so we NEED to use #==. In such dictionary we put ALL objects of the graph, so it can be very big. Since IdentitySet uses #identityHash, it means it will be using those ONLY 12 bits in the object header. It means that we have 2^12 = 4096  different values. 
>>>
>>> Question:  having explained the previous, I wanted to be able to use #hash rather than #identityHash since several classes implement #hash and hence I thought that using #hash I could have less colisions and hence a better performance. I tried to make a subclass of IdentitySet that uses #hash rather than #identityHash but my image freezes. I also tried something like:
>>>
>>> set := PluggableSet new.
>>>     set hashBlock: [ :elem | elem hash ].
>>>     set equalBlock: [ :a :b | a == b ].
>>>
>>> But it doesn't work either. I works with simple tests in a workspace but when I run the full tests of Fuel, it enters in a loop in the method #hash of Collection..
>>>
>>> Anyway, my question is, should that work? if not, what is the exact reason?
>>>
>>> Thanks in advance,
>>>
>>> --
>>> Mariano
>>> http://marianopeck.wordpress.com
>>>
>>>
>>
>>
>>
>> --
>> Mariano
>> http://marianopeck.wordpress.com
>>
>
>
>
> --
> Mariano
> http://marianopeck.wordpress.com
>
>
>



--
Mariano
http://marianopeck.wordpress.com




--
best,
Eliot

Reply | Threaded
Open this post in threaded view
|

Re: IdentitySet but using #hash rather than #identityHash ?

Mariano Martinez Peck


On Tue, Dec 13, 2011 at 8:44 PM, Eliot Miranda <[hidden email]> wrote:
Hi Mariano,

On Tue, Dec 13, 2011 at 12:37 AM, Mariano Martinez Peck <[hidden email]> wrote:


On Tue, Dec 13, 2011 at 8:43 AM, Michael Roberts <[hidden email]> wrote:
Hi Mariano, when I read this thread I was a bit confused that you wanted an IdentitySet that used #hash. From that statement it sounded like you just wanted a Set. This would allow any object to define its own hash and importantly what equality means with #=. So if you want to delegate that to the object use a Set.


No, I cannot use a Set because I cannot have repetitions. This that I am asking is what we do in Fuel serializer while traversing the object graph. Each analyzed object is put in an IdentityDictionary (but the question is the same for IdentitySet) as key. To avoid cycles, I need to put each object only once. Since graphs can be very big, such dict could have lots of objects.
 
However as the thread has gone on perhaps you want the identity relationship but you just wanted a bigger identity hash space?

Yes, exactly. I were thinking if there could be a way (maybe...that's why I am asking) of improve its performance considering that I could have much more objects than 2^13. In other words, I wanted to see if I could avoid colisions.

You can assume that certain properties of objects will not change during serialization, for example the class of objects, the basic size of objects.  So you can construct a valid extended identity hash from these properties.  For example

fuelSerializationHash
    ^self identityHash + (self class identityHash bitShift: 12) + (self basicSize bitShift: 24)


Thanks!!! that's exactly what I wanted to try :)
 
In general this idea may not work because of meta-primitives like changeClassTo:, which would change the hash.  But in Fuel's case I think it's safe to assume that objects won't change class or size during serialization.

Exactly :)  Moreoever, if objects change their state or shape, or, in other words, in the graph changes while we are in the middle of the serialization, we will be screw anyway....whether the hash has changed or not :)

I will see what is worst, if the colisions or having to send 2 times #identityHash and 1 #basicSize.

Thanks!
 

HTH

 
 

IdentitySets are fast because they bypass any delegation. Once you have seen the object in your traversal (common usage pattern) that's it. You grab it's identity which is a pretty low level thing to do.

Ok...it is a tradeoff. If I use #identityHash it is fast because there is no delegation and it is almost an immediate primitive. But I gues it will be slow if there are lots of colisions. Not using #identityHash but something else maybe could decrease maybe the amount of colisions, but maybe with the delegation it will gets slower.

I will try with Levente idea of what it is done in SystemTracer: use as a hash the identityHash of the object mixed with the identityHash of its class. Maybe that decreases the colisions and at the same time I don't pay delegation (#class is special bytecode, so nothing, and ok..there are 2 sends to #identityhash but I don't thinnk it is that much).

Anyway, thanks for the interesting post, I always learn :)
 

As for what happens with collections who knows. Depends. Relying on identity set semantics for a collection is easy. Set semantics is not so. Remember that both hash and equals are important to know if the set already contains the element. Depending on the collection implementation both of these could be composite in terms of the parts. Who knows where you end or how long it takes. I.e. if it is a function of the size of the collection and further collections are composite....


yes...
 
Cheers
Mike



On Tuesday, December 13, 2011, Carlo <[hidden email]> wrote:
> Hi Mariano
> I'm no expert either ;)
> Without having access to the exact code it would look like either you have a collection that references itself (which would break all collection implementations) or maybe the tests have just slowed down to the point where you think it's 'crashed'.
> Do you have anymore info or perhaps which methods you changed on IdentitySet?
> Cheers
> Carlo
> On 13 Dec 2011, at 1:57 AM, Mariano Martinez Peck wrote:
>
>
> On Tue, Dec 13, 2011 at 12:32 AM, Mariano Martinez Peck <[hidden email]> wrote:
>>
>>
>> On Mon, Dec 12, 2011 at 1:56 PM, Carlo <[hidden email]> wrote:
>>>
>>> Hi
>>> Wouldn't the fact that you use hash cause potential loops now? e.g. collection refers to another object that refers to first collection. --> aCollection>>hash references an item which causes this current collection's hash to be called again?
>>
>> Hi Carlo. I am still newbie with Collections but I think I am having exactly that problem. During my tests, it loops in Collection >> #hash  when sending #hash to its elements.
>> Sorry, but I couldn't undertand what is the cause of the problem?  why it doesn't work while it does using #identityHash?  could you elaborate?
>>
>
> Well, now I understood, and I understand also why it doesn't happen with #identityHash. But what happens then with regular Dictionaries using #hash? why it doesn't happen there?
>  
>>
>> thanks
>>  
>>>
>>> identityHash is deterministic in this case.
>>> Does this help?
>>> Cheers
>>> Carlo
>>> On 12 Dec 2011, at 10:58 AM, Mariano Martinez Peck wrote:
>>> Hi guys. I hope this is not a very stupid question. Background: in Fuel we have a IdentityDictionary where we put each of the objects we find while traversing the graph. We need to use IdentitySet because we cannot have repetitions (and to avoid loops) so we NEED to use #==. In such dictionary we put ALL objects of the graph, so it can be very big. Since IdentitySet uses #identityHash, it means it will be using those ONLY 12 bits in the object header. It means that we have 2^12 = 4096  different values. 
>>>
>>> Question:  having explained the previous, I wanted to be able to use #hash rather than #identityHash since several classes implement #hash and hence I thought that using #hash I could have less colisions and hence a better performance. I tried to make a subclass of IdentitySet that uses #hash rather than #identityHash but my image freezes. I also tried something like:
>>>
>>> set := PluggableSet new.
>>>     set hashBlock: [ :elem | elem hash ].
>>>     set equalBlock: [ :a :b | a == b ].
>>>
>>> But it doesn't work either. I works with simple tests in a workspace but when I run the full tests of Fuel, it enters in a loop in the method #hash of Collection..
>>>
>>> Anyway, my question is, should that work? if not, what is the exact reason?
>>>
>>> Thanks in advance,
>>>
>>> --
>>> Mariano
>>> http://marianopeck.wordpress.com
>>>
>>>
>>
>>
>>
>> --
>> Mariano
>> http://marianopeck.wordpress.com
>>
>
>
>
> --
> Mariano
> http://marianopeck.wordpress.com
>
>
>



--
Mariano
http://marianopeck.wordpress.com




--
best,
Eliot




--
Mariano
http://marianopeck.wordpress.com

Reply | Threaded
Open this post in threaded view
|

Re: IdentitySet but using #hash rather than #identityHash ?

Mariano Martinez Peck
Well...it is much slower :(  it seems that the cost of  (aKey identityHash + ( aKey mareaClass identityHash bitShift: 12) + (aKey basicSize bitShift: 24)
is bigger than the colisions.
Anyway, thanks for the nice thread. I learned.

Cheers

On Tue, Dec 13, 2011 at 8:51 PM, Mariano Martinez Peck <[hidden email]> wrote:


On Tue, Dec 13, 2011 at 8:44 PM, Eliot Miranda <[hidden email]> wrote:
Hi Mariano,

On Tue, Dec 13, 2011 at 12:37 AM, Mariano Martinez Peck <[hidden email]> wrote:


On Tue, Dec 13, 2011 at 8:43 AM, Michael Roberts <[hidden email]> wrote:
Hi Mariano, when I read this thread I was a bit confused that you wanted an IdentitySet that used #hash. From that statement it sounded like you just wanted a Set. This would allow any object to define its own hash and importantly what equality means with #=. So if you want to delegate that to the object use a Set.


No, I cannot use a Set because I cannot have repetitions. This that I am asking is what we do in Fuel serializer while traversing the object graph. Each analyzed object is put in an IdentityDictionary (but the question is the same for IdentitySet) as key. To avoid cycles, I need to put each object only once. Since graphs can be very big, such dict could have lots of objects.
 
However as the thread has gone on perhaps you want the identity relationship but you just wanted a bigger identity hash space?

Yes, exactly. I were thinking if there could be a way (maybe...that's why I am asking) of improve its performance considering that I could have much more objects than 2^13. In other words, I wanted to see if I could avoid colisions.

You can assume that certain properties of objects will not change during serialization, for example the class of objects, the basic size of objects.  So you can construct a valid extended identity hash from these properties.  For example

fuelSerializationHash
    ^self identityHash + (self class identityHash bitShift: 12) + (self basicSize bitShift: 24)


Thanks!!! that's exactly what I wanted to try :)
 
In general this idea may not work because of meta-primitives like changeClassTo:, which would change the hash.  But in Fuel's case I think it's safe to assume that objects won't change class or size during serialization.

Exactly :)  Moreoever, if objects change their state or shape, or, in other words, in the graph changes while we are in the middle of the serialization, we will be screw anyway....whether the hash has changed or not :)

I will see what is worst, if the colisions or having to send 2 times #identityHash and 1 #basicSize.

Thanks!
 

HTH

 
 

IdentitySets are fast because they bypass any delegation. Once you have seen the object in your traversal (common usage pattern) that's it. You grab it's identity which is a pretty low level thing to do.

Ok...it is a tradeoff. If I use #identityHash it is fast because there is no delegation and it is almost an immediate primitive. But I gues it will be slow if there are lots of colisions. Not using #identityHash but something else maybe could decrease maybe the amount of colisions, but maybe with the delegation it will gets slower.

I will try with Levente idea of what it is done in SystemTracer: use as a hash the identityHash of the object mixed with the identityHash of its class. Maybe that decreases the colisions and at the same time I don't pay delegation (#class is special bytecode, so nothing, and ok..there are 2 sends to #identityhash but I don't thinnk it is that much).

Anyway, thanks for the interesting post, I always learn :)
 

As for what happens with collections who knows. Depends. Relying on identity set semantics for a collection is easy. Set semantics is not so. Remember that both hash and equals are important to know if the set already contains the element. Depending on the collection implementation both of these could be composite in terms of the parts. Who knows where you end or how long it takes. I.e. if it is a function of the size of the collection and further collections are composite....


yes...
 
Cheers
Mike



On Tuesday, December 13, 2011, Carlo <[hidden email]> wrote:
> Hi Mariano
> I'm no expert either ;)
> Without having access to the exact code it would look like either you have a collection that references itself (which would break all collection implementations) or maybe the tests have just slowed down to the point where you think it's 'crashed'.
> Do you have anymore info or perhaps which methods you changed on IdentitySet?
> Cheers
> Carlo
> On 13 Dec 2011, at 1:57 AM, Mariano Martinez Peck wrote:
>
>
> On Tue, Dec 13, 2011 at 12:32 AM, Mariano Martinez Peck <[hidden email]> wrote:
>>
>>
>> On Mon, Dec 12, 2011 at 1:56 PM, Carlo <[hidden email]> wrote:
>>>
>>> Hi
>>> Wouldn't the fact that you use hash cause potential loops now? e.g. collection refers to another object that refers to first collection. --> aCollection>>hash references an item which causes this current collection's hash to be called again?
>>
>> Hi Carlo. I am still newbie with Collections but I think I am having exactly that problem. During my tests, it loops in Collection >> #hash  when sending #hash to its elements.
>> Sorry, but I couldn't undertand what is the cause of the problem?  why it doesn't work while it does using #identityHash?  could you elaborate?
>>
>
> Well, now I understood, and I understand also why it doesn't happen with #identityHash. But what happens then with regular Dictionaries using #hash? why it doesn't happen there?
>  
>>
>> thanks
>>  
>>>
>>> identityHash is deterministic in this case.
>>> Does this help?
>>> Cheers
>>> Carlo
>>> On 12 Dec 2011, at 10:58 AM, Mariano Martinez Peck wrote:
>>> Hi guys. I hope this is not a very stupid question. Background: in Fuel we have a IdentityDictionary where we put each of the objects we find while traversing the graph. We need to use IdentitySet because we cannot have repetitions (and to avoid loops) so we NEED to use #==. In such dictionary we put ALL objects of the graph, so it can be very big. Since IdentitySet uses #identityHash, it means it will be using those ONLY 12 bits in the object header. It means that we have 2^12 = 4096  different values. 
>>>
>>> Question:  having explained the previous, I wanted to be able to use #hash rather than #identityHash since several classes implement #hash and hence I thought that using #hash I could have less colisions and hence a better performance. I tried to make a subclass of IdentitySet that uses #hash rather than #identityHash but my image freezes. I also tried something like:
>>>
>>> set := PluggableSet new.
>>>     set hashBlock: [ :elem | elem hash ].
>>>     set equalBlock: [ :a :b | a == b ].
>>>
>>> But it doesn't work either. I works with simple tests in a workspace but when I run the full tests of Fuel, it enters in a loop in the method #hash of Collection..
>>>
>>> Anyway, my question is, should that work? if not, what is the exact reason?
>>>
>>> Thanks in advance,
>>>
>>> --
>>> Mariano
>>> http://marianopeck.wordpress.com
>>>
>>>
>>
>>
>>
>> --
>> Mariano
>> http://marianopeck.wordpress.com
>>
>
>
>
> --
> Mariano
> http://marianopeck.wordpress.com
>
>
>



--
Mariano
http://marianopeck.wordpress.com




--
best,
Eliot




--
Mariano
http://marianopeck.wordpress.com




--
Mariano
http://marianopeck.wordpress.com

Reply | Threaded
Open this post in threaded view
|

Re: IdentitySet but using #hash rather than #identityHash ?

Guido Stepken

Collision free hashing based on fibonacci pseudo primes? 25 years old ... :-)

On Dec 15, 2011 11:35 PM, "Mariano Martinez Peck" <[hidden email]> wrote:

Well...it is much slower :(  it seems that the cost of  (aKey identityHash + ( aKey mareaClass identityHash bitShift: 12) + (aKey basicSize bitShift: 24)
is bigger than the colisions.
Anyway, thanks for the nice thread. I learned.

Cheers



On Tue, Dec 13, 2011 at 8:51 PM, Mariano Martinez Peck <[hidden email]> wrote:
>
>
>
> On T...

--
Mariano
http://marianopeck.wordpress.com

Reply | Threaded
Open this post in threaded view
|

Re: IdentitySet but using #hash rather than #identityHash ?

Henrik Sperre Johansen
In reply to this post by Mariano Martinez Peck
On 15.12.2011 23:35, Mariano Martinez Peck wrote:
> Well...it is much slower :(  it seems that the cost of  (aKey
> identityHash + ( aKey mareaClass identityHash bitShift: 12) + (aKey
> basicSize bitShift: 24)
> is bigger than the colisions.
> Anyway, thanks for the nice thread. I learned.
>
> Cheers
>
Well, first of all, that always creates a largeInteger, since
identityHash is already shifted by 22... You'd want to be using
basicIdentityHash.
2nd off, basicSize is useless for non-variable classes, which I guess is
the common case.
3rd, putting class hash in the high bits won't really help much if
you're serializing many instances of the same class, as they will all
occupy a sequential hash range.

So if I were you, I'd try with something like
obj basicIdentityHash << 12 + (obj class basicIdentityHash)
before discarding it outright due to computation cost.

Cheers,
Henry

Reply | Threaded
Open this post in threaded view
|

Re: IdentitySet but using #hash rather than #identityHash ?

Henrik Sperre Johansen
In reply to this post by Mariano Martinez Peck
On 16.12.2011 00:40, Henrik Sperre Johansen wrote:

> On 15.12.2011 23:35, Mariano Martinez Peck wrote:
>> Well...it is much slower :(  it seems that the cost of  (aKey
>> identityHash + ( aKey mareaClass identityHash bitShift: 12) + (aKey
>> basicSize bitShift: 24)
>> is bigger than the colisions.
>> Anyway, thanks for the nice thread. I learned.
>>
>> Cheers
>>
> Well, first of all, that always creates a largeInteger, since
> identityHash is already shifted by 22... You'd want to be using
> basicIdentityHash.
> 2nd off, basicSize is useless for non-variable classes, which I guess
> is the common case.
> 3rd, putting class hash in the high bits won't really help much if
> you're serializing many instances of the same class, as they will all
> occupy a sequential hash range.
>
> So if I were you, I'd try with something like
> obj basicIdentityHash << 12 + (obj class basicIdentityHash)
> before discarding it outright due to computation cost.
>
> Cheers,
> Henry
>
Note: This is not portable to Squeak.
They kept the old version of identityHash (renamed to basicIdentityHash
in Pharo), and introduced scaledIdentityHash (doing the same as
identityHash in Pharo), due to backwards compatability concerns for
users explicitly relying on #identityHash returning a value in range [0,
4095].

Cheers,
Henry

Reply | Threaded
Open this post in threaded view
|

Re: IdentitySet but using #hash rather than #identityHash ?

Henrik Sperre Johansen
In reply to this post by Mariano Martinez Peck
On 16.12.2011 00:40, Henrik Sperre Johansen wrote:

> On 15.12.2011 23:35, Mariano Martinez Peck wrote:
>> Well...it is much slower :(  it seems that the cost of  (aKey
>> identityHash + ( aKey mareaClass identityHash bitShift: 12) + (aKey
>> basicSize bitShift: 24)
>> is bigger than the colisions.
>> Anyway, thanks for the nice thread. I learned.
>>
>> Cheers
>>
> Well, first of all, that always creates a largeInteger, since
> identityHash is already shifted by 22... You'd want to be using
> basicIdentityHash.
> 2nd off, basicSize is useless for non-variable classes, which I guess
> is the common case.
> 3rd, putting class hash in the high bits won't really help much if
> you're serializing many instances of the same class, as they will all
> occupy a sequential hash range.
>
> So if I were you, I'd try with something like
> obj basicIdentityHash << 12 + (obj class basicIdentityHash)
> before discarding it outright due to computation cost.
>
> Cheers,
> Henry
>
A few more points:
1) don't use PluggableSets. Block activations are costly, and you REALLY
need specialized data/hash functions for it to be worth using.
2) Even with a subclass of IdentitySet, it's hardly worth it:

PluggableSet:
n := 1000000.
set := PluggableSet new: n.
set hashBlock: [:obj | obj basicIdentityHash << 12 + (obj class
basicIdentityHash) ].
set equalBlock: [ :a :b | a == b ].
Time millisecondsToRun: [1 to: n do: [:waste | set add: Object new.]] 18828

The one I suggested in a custom subclass:
n := 1000000.
set := PseudoIdentitySet new: n.
Time millisecondsToRun: [1 to: n do: [:waste | set add: Object new.]]  
14182

Usual identitySet:
n := 1000000.
set := IdentitySet new: n.
Time millisecondsToRun: [1 to: n do: [:waste | set add: Object new.]] 4595

Custom subclass with a slightly modified hash (anObject identityHash +
(anObject class basicIdentityHash bitShift: 6)):
n := 1000000.
set := PseudoIdentitySet2 new: n.
Time millisecondsToRun: [1 to: n do: [:waste | set add: Object new.]] 3694
Why 6 you ask? Because I'm worth it.

Not really enough to justify the extra compexity, imho.

Cheers,
Henry

Reply | Threaded
Open this post in threaded view
|

Re: IdentitySet but using #hash rather than #identityHash ?

Henrik Sperre Johansen
In reply to this post by Mariano Martinez Peck
On 16.12.2011 01:06, Henrik Sperre Johansen wrote:

>
> Usual identitySet:
> n := 1000000.
> set := IdentitySet new: n.
> Time millisecondsToRun: [1 to: n do: [:waste | set add: Object new.]]
> 4595
>
> Custom subclass with a slightly modified hash (anObject identityHash +
> (anObject class basicIdentityHash bitShift: 6)):
> n := 1000000.
> set := PseudoIdentitySet2 new: n.
> Time millisecondsToRun: [1 to: n do: [:waste | set add: Object new.]]
> 3694
> Why 6 you ask? Because I'm worth it.
>
> Not really enough to justify the extra compexity, imho.
>
> Cheers,
> Henry
>
To put that a bit more in perspective, the extra cost for not
preallocating to a sane value:
n := 1000000.
set := IdentitySet new.
Time millisecondsToRun: [1 to: n do: [:waste | set add: Object new.]]  6872

Cheers,
Henry

Reply | Threaded
Open this post in threaded view
|

Re: IdentitySet but using #hash rather than #identityHash ?

Levente Uzonyi-2
In reply to this post by Henrik Sperre Johansen
On Fri, 16 Dec 2011, Henrik Sperre Johansen wrote:

> On 16.12.2011 00:40, Henrik Sperre Johansen wrote:
>> On 15.12.2011 23:35, Mariano Martinez Peck wrote:
>>> Well...it is much slower :(  it seems that the cost of  (aKey identityHash
>>> + ( aKey mareaClass identityHash bitShift: 12) + (aKey basicSize bitShift:
>>> 24)
>>> is bigger than the colisions.
>>> Anyway, thanks for the nice thread. I learned.
>>>
>>> Cheers
>>>
>> Well, first of all, that always creates a largeInteger, since identityHash
>> is already shifted by 22... You'd want to be using basicIdentityHash.
>> 2nd off, basicSize is useless for non-variable classes, which I guess is
>> the common case.
>> 3rd, putting class hash in the high bits won't really help much if you're
>> serializing many instances of the same class, as they will all occupy a
>> sequential hash range.
>>
>> So if I were you, I'd try with something like
>> obj basicIdentityHash << 12 + (obj class basicIdentityHash)
>> before discarding it outright due to computation cost.
>>
>> Cheers,
>> Henry
>>
> A few more points:
> 1) don't use PluggableSets. Block activations are costly, and you REALLY need
> specialized data/hash functions for it to be worth using.
> 2) Even with a subclass of IdentitySet, it's hardly worth it:
>
> PluggableSet:
> n := 1000000.
> set := PluggableSet new: n.
> set hashBlock: [:obj | obj basicIdentityHash << 12 + (obj class
> basicIdentityHash) ].
> set equalBlock: [ :a :b | a == b ].
> Time millisecondsToRun: [1 to: n do: [:waste | set add: Object new.]] 18828
>
> The one I suggested in a custom subclass:
> n := 1000000.
> set := PseudoIdentitySet new: n.
> Time millisecondsToRun: [1 to: n do: [:waste | set add: Object new.]]  14182
>
> Usual identitySet:
> n := 1000000.
> set := IdentitySet new: n.
> Time millisecondsToRun: [1 to: n do: [:waste | set add: Object new.]] 4595
>
> Custom subclass with a slightly modified hash (anObject identityHash +
> (anObject class basicIdentityHash bitShift: 6)):
> n := 1000000.
> set := PseudoIdentitySet2 new: n.
> Time millisecondsToRun: [1 to: n do: [:waste | set add: Object new.]] 3694
> Why 6 you ask? Because I'm worth it.
>
> Not really enough to justify the extra compexity, imho.

How about my numbers? :)

"Preallocate objects, so we won't count gc time."
n := 1000000.
objects := Array new: n streamContents: [ :stream |
  n timesRepeat: [ stream nextPut: Object new ] ].

set := IdentitySet new: n.
Smalltalk garbageCollect.
[1 to: n do: [ :i | set add: (objects at: i) ] ] timeToRun. "4949"

set := LargeIdentitySet new.
Smalltalk garbageCollect.
[1 to: n do: [ :i | set add: (objects at: i) ] ] timeToRun. "331"

set := (PluggableSet new: n)
  hashBlock: [ :object | object identityHash * 4096 + object class identityHash * 64 ]; "Change this to #basicIdentityHash in Pharo"
  equalBlock: [ :a :b | a == b ];
  yourself.
Smalltalk garbageCollect.
[1 to: n do: [ :i | set add: (objects at: i) ] ] timeToRun. "5511"


I also have a LargeIdentityDictionary, which is relatively fast, but not
as fast as LargeIdentitySet, because (for some unknown reason) we don't
have a primitive that could support it. If we had a primitive like
primitive 132 which would return the index of the element if found or 0
if not, then we could have a really fast LargeIdentityDictionary.


Levente

>
> Cheers,
> Henry
>
>

1234