Unique all characters?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Unique all characters?

Eliot Miranda-2
Hi All,

    having just stumbled across the fact that only characters with codes from 0 to 255 are unique I wondered whether anyone has considered doing the following:

Character addClassVarNamed: 'LargeCodeCharacters'.


Character class methods for class initialization
initialize
    [LargeCodeCharacters := WeakSet new


Character class methods for instance creation
value: anInteger
    
"Answer the Character whose value is anInteger."

    
| theCharacter existingInstanceOrNil |
    
anInteger <= 255 ifTrue:
        
[^CharacterTable at: anInteger + 1].
    
theCharacter := self basicNew setValue: anInteger.
    
^(existingInstanceOrNil := LargeCodeCharacters like: theCharacter)
        
ifNil: [LargeCodeCharacters add: theCharacter]
        
ifNotNil: [existingInstanceOrNil]


Yes this has the potential to create a lot of space overhead, but only for artificial codes that enumerate over all characters.  I suspect that for most cases the actual set of active characters would be quite small.

(Alternatives that are indexed by integers might also work well, e.g. a flat WeakValueDictionary that used a WeakArray for its values).

Just a thought...


Reply | Threaded
Open this post in threaded view
|

Re: Unique all characters?

Philippe Marschall
2008/1/28, Eliot Miranda <[hidden email]>:

> Hi All,
>
>     having just stumbled across the fact that only characters with codes
> from 0 to 255 are unique I wondered whether anyone has considered doing the
> following:
>
> Character addClassVarNamed: 'LargeCodeCharacters'.
>
>
> Character class methods for class initialization
> initialize
>     [LargeCodeCharacters := WeakSet new
>
>
> Character class methods for instance creation
> value: anInteger
>      "Answer the Character whose value is anInteger."
>
>      | theCharacter existingInstanceOrNil |
>      anInteger <= 255 ifTrue:
>          [^CharacterTable at: anInteger + 1].
>      theCharacter := self basicNew setValue: anInteger.
>      ^(existingInstanceOrNil := LargeCodeCharacters like: theCharacter)
>          ifNil: [LargeCodeCharacters add: theCharacter]
>          ifNotNil: [existingInstanceOrNil]
>
>
> Yes this has the potential to create a lot of space overhead, but only for
> artificial codes that enumerate over all characters.  I suspect that for
> most cases the actual set of active characters would be quite small.
>
> (Alternatives that are indexed by integers might also work well, e.g. a flat
> WeakValueDictionary that used a WeakArray for its values).

Lord no, please no more Weak* collections. That was one of the major
performance fixes we did in Seaside, kicking Weak* collections. They
don't scale, they kill you in production.

Cheers
Philippe

Reply | Threaded
Open this post in threaded view
|

Re: Unique all characters?

Colin Putney
In reply to this post by Eliot Miranda-2

On 28-Jan-08, at 10:23 AM, Eliot Miranda wrote:

>     having just stumbled across the fact that only characters with  
> codes from 0 to 255 are unique I wondered whether anyone has  
> considered doing the following:

What would be the motivation?

Colin

Reply | Threaded
Open this post in threaded view
|

Re: Unique all characters?

Andreas.Raab
In reply to this post by Philippe Marschall
Philippe Marschall wrote:
>> (Alternatives that are indexed by integers might also work well, e.g. a flat
>> WeakValueDictionary that used a WeakArray for its values).
>
> Lord no, please no more Weak* collections. That was one of the major
> performance fixes we did in Seaside, kicking Weak* collections. They
> don't scale, they kill you in production.

It's finalization that kills you, not weak collections per se. If it
were then symbol management should cause the same issues. For the case
in question you could shrink the character table on system startup /
shutdown which would avoid the finalization issues.

Cheers,
   - Andreas

Reply | Threaded
Open this post in threaded view
|

Re: Unique all characters?

Bert Freudenberg
On Jan 28, 2008, at 21:51 , Andreas Raab wrote:

> Philippe Marschall wrote:
>>> (Alternatives that are indexed by integers might also work well,  
>>> e.g. a flat
>>> WeakValueDictionary that used a WeakArray for its values).
>> Lord no, please no more Weak* collections. That was one of the major
>> performance fixes we did in Seaside, kicking Weak* collections. They
>> don't scale, they kill you in production.
>
> It's finalization that kills you, not weak collections per se. If  
> it were then symbol management should cause the same issues. For  
> the case in question you could shrink the character table on system  
> startup / shutdown which would avoid the finalization issues.

Well, instances of characters are mostly temporary - Strings actually  
store binary numbers, not Character instances, they create Characters  
on the fly.

One would have to measure the space and performance trade-offs of  
looking up unique Character instances vs. simply creating characters  
when needed. My hunch is it doesn't matter so is not worth the added  
complexity.

- Bert -



Reply | Threaded
Open this post in threaded view
|

Re: Unique all characters?

Nicolas Cellier-3
Making Character immediate like SmallInteger instead of OOP would
probably make a difference.

Both unique and fast. For the price of added complexity in the VM.

But Eliot must know this for sure.

Nicolas

Bert Freudenberg a écrit :

> On Jan 28, 2008, at 21:51 , Andreas Raab wrote:
>
>> Philippe Marschall wrote:
>>>> (Alternatives that are indexed by integers might also work well,
>>>> e.g. a flat
>>>> WeakValueDictionary that used a WeakArray for its values).
>>> Lord no, please no more Weak* collections. That was one of the major
>>> performance fixes we did in Seaside, kicking Weak* collections. They
>>> don't scale, they kill you in production.
>>
>> It's finalization that kills you, not weak collections per se. If it
>> were then symbol management should cause the same issues. For the case
>> in question you could shrink the character table on system startup /
>> shutdown which would avoid the finalization issues.
>
> Well, instances of characters are mostly temporary - Strings actually
> store binary numbers, not Character instances, they create Characters on
> the fly.
>
> One would have to measure the space and performance trade-offs of
> looking up unique Character instances vs. simply creating characters
> when needed. My hunch is it doesn't matter so is not worth the added
> complexity.
>
> - Bert -
>
>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Unique all characters?

Yoshiki Ohshima-2
In reply to this post by Eliot Miranda-2
  Eliot,

> (Alternatives that are indexed by integers might also work well, e.g. a flat WeakValueDictionary that used a WeakArray
> for its values).

  Yes, there would be no reason to do basicNew.

  But I need to ask the same question Colin asked; what would we gain?
For most of the time, characters are in strings so there are not many
real instances around.  For writing a parser (hmm) it might make
things a bit faster, but not much.

  Tagged immediate character objects would have been ok (like
VisualWorks), but it puts some pressure on integers...

-- Yoshiki

Reply | Threaded
Open this post in threaded view
|

Re: Unique all characters?

Nicolas Cellier-3
Yoshiki Ohshima a écrit :

>   Eliot,
>
>> (Alternatives that are indexed by integers might also work well, e.g. a flat WeakValueDictionary that used a WeakArray
>> for its values).
>
>   Yes, there would be no reason to do basicNew.
>
>   But I need to ask the same question Colin asked; what would we gain?
> For most of the time, characters are in strings so there are not many
> real instances around.  For writing a parser (hmm) it might make
> things a bit faster, but not much.
>
>   Tagged immediate character objects would have been ok (like
> VisualWorks), but it puts some pressure on integers...
>

Sure, 1 bit less.
This has to be thought again when 64 bits Squeak will spread.

Nicolas

> -- Yoshiki
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Unique all characters?

Nicolas Cellier-3
nicolas cellier a écrit :

> Yoshiki Ohshima a écrit :
>>   Eliot,
>>
>>> (Alternatives that are indexed by integers might also work well, e.g.
>>> a flat WeakValueDictionary that used a WeakArray
>>> for its values).
>>
>>   Yes, there would be no reason to do basicNew.
>>
>>   But I need to ask the same question Colin asked; what would we gain?
>> For most of the time, characters are in strings so there are not many
>> real instances around.  For writing a parser (hmm) it might make
>> things a bit faster, but not much.
>>
>>   Tagged immediate character objects would have been ok (like
>> VisualWorks), but it puts some pressure on integers...
>>
>
> Sure, 1 bit less.
> This has to be thought again when 64 bits Squeak will spread.
>
> Nicolas
>

But we do not need to reserve two tag bits for every case:

xxx..xxx1 is a SmallInteger on 31 bits
xxx..xx10 is a Character
xxx..xx00 is an OOP

This does not put pressure on integer

Nicolas


Reply | Threaded
Open this post in threaded view
|

Re: Unique all characters?

Andreas.Raab
nicolas cellier wrote:
> But we do not need to reserve two tag bits for every case:
>
> xxx..xxx1 is a SmallInteger on 31 bits
> xxx..xx10 is a Character
> xxx..xx00 is an OOP
>
> This does not put pressure on integer

You may want to read this post:

http://lists.squeakfoundation.org/pipermail/vm-dev/2006-January/000429.html

It outlines a similar approach except that it adds 64 immediate classes
instead of one and in return reduces the number of available bits to 24
(which makes for nice 1x24, 2x12, 3x8, 4x6 usage patterns in characters,
immediate points, short colors etc)

Cheers,
   - Andreas

Reply | Threaded
Open this post in threaded view
|

Re: Unique all characters?

Nicolas Cellier-3
Andreas Raab a écrit :

> nicolas cellier wrote:
>> But we do not need to reserve two tag bits for every case:
>>
>> xxx..xxx1 is a SmallInteger on 31 bits
>> xxx..xx10 is a Character
>> xxx..xx00 is an OOP
>>
>> This does not put pressure on integer
>
> You may want to read this post:
>
> http://lists.squeakfoundation.org/pipermail/vm-dev/2006-January/000429.html

Yes thanks

>
> It outlines a similar approach except that it adds 64 immediate classes
> instead of one and in return reduces the number of available bits to 24
> (which makes for nice 1x24, 2x12, 3x8, 4x6 usage patterns in characters,
> immediate points, short colors etc)
>
> Cheers,
>   - Andreas
>
>