Smalltalk › Squeak › Squeak - Dev

The Inbox: Collections-ul.624.mcz

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

3 messages Options

commits-2

The Inbox: Collections-ul.624.mcz

A new version of Collections was added to project The Inbox:
http://source.squeak.org/inbox/Collections-ul.624.mcz

==================== Summary ====================

Name: Collections-ul.624
Author: ul
Time: 1 May 2015, 1:40:09.445 pm
UUID: a89640e5-e0cc-44e1-9613-e065b78258ad
Ancestors: Collections-tfel.623

Various improvements related to Characters and Strings.

Resurrected Character's ClassificationTable
- initialize it based on the actual EncodedCharSet for leadingChar 0 (Unicode)
- added a bit for #isDigit
- added a mask for #isAlphaNumeirc
- use 0 as a tag if the uppercase or lowercase character value is greater than 255
- initialize the table in the postscript (to not infere with #initialize)

Character >> #encodedCharSet assumes than leadingChar 0 means Unicode. This is hardcoded in a few methods, so why not unify it here?

Simpler and faster Character >> #tokenish.

Simpler and faster String >> #withoutLineEndings.

Precalculate the size of the result in Symbol >> #numArgs: to avoid an extra allocation.

=============== Diff against Collections-tfel.623 ===============

Item was changed:
Magnitude subclass: #Character
instanceVariableNames: 'value'
+ classVariableNames: 'AlphaNumericMask CharacterTable ClassificationTable DigitBit DigitValues LetterMask LowercaseBit UppercaseBit'
- classVariableNames: 'CharacterTable ClassificationTable DigitValues LetterBits LowercaseBit UppercaseBit'
poolDictionaries: ''
category: 'Collections-Strings'!

!Character commentStamp: 'ar 4/9/2005 22:35' prior: 0!
I represent a character by storing its associated Unicode. The first 256 characters are created uniquely, so that all instances of latin1 characters ($R, for example) are identical.

The code point is based on Unicode. Since Unicode is 21-bit wide character set, we have several bits available for other information. As the Unicode Standard states, a Unicode code point doesn't carry the language information. This is going to be a problem with the languages so called CJK (Chinese, Japanese, Korean. Or often CJKV including Vietnamese). Since the characters of those languages are unified and given the same code point, it is impossible to display a bare Unicode code point in an inspector or such tools. To utilize the extra available bits, we use them for identifying the languages. Since the old implementation uses the bits to identify the character encoding, the bits are sometimes called "encoding tag" or neutrally "leading char", but the bits rigidly denotes the concept of languages.

The other languages can have the language tag if you like. This will help to break the large default font (font set) into separately loadable chunk of fonts. However, it is open to the each native speakers and writers to decide how to define the character equality, since the same Unicode code point may have different language tag thus simple #= comparison may return false.

I represent a character by storing its associated ASCII code (extended to 256 codes). My instances are created uniquely, so that all instances of a character ($R, for example) are identical.!

Item was changed:
----- Method: Character class>>initializeClassificationTable (in category 'class initialization') -----
initializeClassificationTable
+ "Initialize the classification table.
+ The classification table is a compact encoding of upper and lower cases and digits of characters with
+ - bits 0-7: The lower case value of this character or 0, if its greater than 255.
+ - bits 8-15: The upper case value of this character or 0, if its greater than 255.
+ - bit 16: lowercase bit (isLowercase == true)
+ - bit 17: uppercase bit (isUppercase == true)
+ - bit 18: digit bit (isDigit == true)"
+ " self initializeClassificationTable "
- "
- Initialize the classification table. The classification table is a
- compact encoding of upper and lower cases of characters with

+ | encodedCharSet |
+ "Base the table on the EncodedCharset of these characters' leadingChar - 0."
+ encodedCharSet := EncodedCharSet charsetAt: 0.
- - bits 0-7: The lower case value of this character.
- - bits 8-15: The upper case value of this character.
- - bit 16: lowercase bit (e.g., isLowercase == true)
- - bit 17: uppercase bit (e.g., isUppercase == true)

- "
- | ch1 |
-
LowercaseBit := 1 bitShift: 16.
UppercaseBit := 1 bitShift: 17.
+ DigitBit := 1 bitShift: 18.

+ "Initialize the letter mask (e.g., isLetter == true)"
+ LetterMask := LowercaseBit bitOr: UppercaseBit.
- "Initialize the letter bits (e.g., isLetter == true)"
- LetterBits := LowercaseBit bitOr: UppercaseBit.

+ "Initialize the alphanumeric mask (e.g. isAlphaNumeric == true)"
+ AlphaNumericMask := LetterMask bitOr: DigitBit.
- ClassificationTable := Array new: 256.
- "Initialize the defaults (neither lower nor upper case)"
- 0 to: 255 do:[:i|
- ClassificationTable at: i+1 put: (i bitShift: 8) + i.
- ].

+ "Initialize the table based on encodedCharSet."
+ ClassificationTable := Array new: 256.
+ 0 to: 255 do: [ :code |
+ | isLowercase isUppercase isDigit lowercaseCode uppercaseCode value |
+ isLowercase := encodedCharSet isLowercaseCode: code.
+ isUppercase := encodedCharSet isUppercaseCode: code.
+ isDigit := encodedCharSet isDigitCode: code.
+ lowercaseCode := encodedCharSet toLowercaseCode: code.
+ lowercaseCode > 255 ifTrue: [ lowercaseCode := 0 ].
+ uppercaseCode := encodedCharSet toUppercaseCode: code.
+ uppercaseCode > 255 ifTrue: [ uppercaseCode := 0 ].
+ value := (uppercaseCode bitShift: 8) + lowercaseCode.
+ isLowercase ifTrue: [ value := value bitOr: LowercaseBit ].
+ isUppercase ifTrue: [ value := value bitOr: UppercaseBit ].
+ isDigit ifTrue: [ value := value bitOr: DigitBit ].
+ ClassificationTable at: code + 1 put: value ]!
- "Initialize character pairs (upper-lower case)"
- #(
- "Basic roman"
- ($A $a) ($B $b) ($C $c) ($D $d)
- ($E $e) ($F $f) ($G $g) ($H $h)
- ($I $i) ($J $j) ($K $k) ($L $l)
- ($M $m) ($N $n) ($O $o) ($P $p)
- ($Q $q) ($R $r) ($S $s) ($T $t)
- ($U $u) ($V $v) ($W $w) ($X $x)
- ($Y $y) ($Z $z)
- "International"
- ($Ä $ä) ($Å $å) ($Ç $ç) ($É $é)
- ($Ñ $ñ) ($Ö $ö) ($Ü $ü) ($À $à)
- ($Ã $ã) ($Õ $õ) ($ $) ($Æ $æ)
- "International - Spanish"
- ($Á $á) ($Í $í) ($Ó $ó) ($Ú $ú)
- "International - PLEASE CHECK"
- ($È $è) ($Ì $ì) ($Ò $ò) ($Ù $ù)
- ($Ë $ë) ($Ï $ï)
- ($Â $â) ($Ê $ê) ($Î $î) ($Ô $ô) ($Û $û)
- ) do:[:pair| | ch2 |
- ch1 := pair first asciiValue.
- ch2 := pair last asciiValue.
- ClassificationTable at: ch1+1 put: (ch1 bitShift: 8) + ch2 + UppercaseBit.
- ClassificationTable at: ch2+1 put: (ch1 bitShift: 8) + ch2 + LowercaseBit.
- ].
-
- "Initialize a few others for which we only have lower case versions."
- #($ß $Ø $ø $ÿ) do:[:char|
- ch1 := char asciiValue.
- ClassificationTable at: ch1+1 put: (ch1 bitShift: 8) + ch1 + LowercaseBit.
- ].
- !

Item was changed:
----- Method: Character>>encodedCharSet (in category 'accessing') -----
encodedCharSet
+
+ value < 16r400000 ifTrue: [ ^Unicode ]. "Shortcut"
-
^EncodedCharSet charsetAt: self leadingChar
!

Item was changed:
----- Method: Character>>tokenish (in category 'testing') -----
tokenish
+ "Answer whether the receiver is a valid token-character--letter, digit, or colon."
- "Answer whether the receiver is a valid token-character--letter, digit, or
- colon."

+ self == $_ ifTrue: [ ^Scanner prefAllowUnderscoreSelectors ].
+ ^self == $: or: [ self isAlphaNumeric ]!
- ^ self == $_
- ifTrue: [ Scanner prefAllowUnderscoreSelectors ]
- ifFalse: [ self == $: or: [ self isLetter or: [ self isDigit ] ] ]!

Item was changed:
----- Method: String>>withoutLineEndings (in category 'converting') -----
withoutLineEndings

+ ^self withLineEndings: ' '!
- ^ self withSqueakLineEndings
- copyReplaceAll: String cr
- with: ' '
- asTokens: false!

Item was changed:
----- Method: Symbol>>numArgs: (in category 'system primitives') -----
numArgs: n
"Answer a string that can be used as a selector with n arguments.
TODO: need to be extended to support shrinking and for selectors like #+ "

+ | numArgs offset |.
+ (numArgs := self numArgs) >= n ifTrue: [ ^self ].
+ numArgs = 0
+ ifTrue: [ offset := 1 ]
+ ifFalse: [ offset := 0 ].
+ ^(String new: n - numArgs + offset * 5 + offset + self size streamContents: [ :stream |
+ stream nextPutAll: self.
+ numArgs = 0 ifTrue: [ stream nextPut: $:. ].
+ numArgs + offset + 1 to: n do: [ :i | stream nextPutAll: 'with:' ] ]) asSymbol!
- | numArgs aStream offs |.
- (numArgs := self numArgs) >= n ifTrue: [^self].
- aStream := WriteStream on: (String new: 16).
- aStream nextPutAll: self.
-
- (numArgs = 0) ifTrue: [aStream nextPutAll: ':'. offs := 0] ifFalse: [offs := 1].
- 2 to: n - numArgs + offs do: [:i | aStream nextPutAll: 'with:'].
- ^aStream contents asSymbol!

Item was changed:
+ (PackageInfo named: 'Collections') postscript: 'Character initializeClassificationTable'!
- (PackageInfo named: 'Collections') postscript: 'LRUCache allInstances do: [ :each | each reset ]'!

Levente Uzonyi-2

Re: The Inbox: Collections-ul.624.mcz

While I was trying to resurrect Character's ClassificationTable, I found
that Unicode's ToLower and ToUpper table are initialized incorrectly. I
changed the initialization code in Multilingual-ul.209 (and 208), which
now uses the UnicodeData.txt file instead of CaseFolding.txt.
This means that all #asUppercase and #asLowercase implementations will
behave slightly differently (the worst one was Character's, which was
still based on some LatinX encoding for byte characters).
To review the changes load Multilingual-ul.209 from the Inbox first, then
merge Collections-ul.624, and finally merge Collections-ul.625.

Levente

On Fri, 1 May 2015, [hidden email] wrote:

> A new version of Collections was added to project The Inbox:
> http://source.squeak.org/inbox/Collections-ul.624.mcz
>
> ==================== Summary ====================
>
> Name: Collections-ul.624
> Author: ul
> Time: 1 May 2015, 1:40:09.445 pm
> UUID: a89640e5-e0cc-44e1-9613-e065b78258ad
> Ancestors: Collections-tfel.623
>
> Various improvements related to Characters and Strings.
>
> Resurrected Character's ClassificationTable
> - initialize it based on the actual EncodedCharSet for leadingChar 0 (Unicode)
> - added a bit for #isDigit
> - added a mask for #isAlphaNumeirc
> - use 0 as a tag if the uppercase or lowercase character value is greater than 255
> - initialize the table in the postscript (to not infere with #initialize)
>
> Character >> #encodedCharSet assumes than leadingChar 0 means Unicode. This is hardcoded in a few methods, so why not unify it here?
>
> Simpler and faster Character >> #tokenish.
>
> Simpler and faster String >> #withoutLineEndings.
>
> Precalculate the size of the result in Symbol >> #numArgs: to avoid an extra allocation.
>
> =============== Diff against Collections-tfel.623 ===============
>
> Item was changed:
> Magnitude subclass: #Character
> instanceVariableNames: 'value'
> + classVariableNames: 'AlphaNumericMask CharacterTable ClassificationTable DigitBit DigitValues LetterMask LowercaseBit UppercaseBit'
> - classVariableNames: 'CharacterTable ClassificationTable DigitValues LetterBits LowercaseBit UppercaseBit'
> poolDictionaries: ''
> category: 'Collections-Strings'!
>
> !Character commentStamp: 'ar 4/9/2005 22:35' prior: 0!
> I represent a character by storing its associated Unicode. The first 256 characters are created uniquely, so that all instances of latin1 characters ($R, for example) are identical.
>
> The code point is based on Unicode. Since Unicode is 21-bit wide character set, we have several bits available for other information. As the Unicode Standard states, a Unicode code point doesn't carry the language information. This is going to be a problem with the languages so called CJK (Chinese, Japanese, Korean. Or often CJKV including Vietnamese). Since the characters of those languages are unified and given the same code point, it is impossible to display a bare Unicode code point in an inspector or such tools. To utilize the extra available bits, we use them for identifying the languages. Since the old implementation uses the bits to identify the character encoding, the bits are sometimes called "encoding tag" or neutrally "leading char", but the bits rigidly denotes the concept of languages.
>
> The other languages can have the language tag if you like. This will help to break the large default font (font set) into separately loadable chunk of fonts. However, it is open to the each native speakers and writers to decide how to define the character equality, since the same Unicode code point may have different language tag thus simple #= comparison may return false.
>
> I represent a character by storing its associated ASCII code (extended to 256 codes). My instances are created uniquely, so that all instances of a character ($R, for example) are identical.!
>
> Item was changed:
> ----- Method: Character class>>initializeClassificationTable (in category 'class initialization') -----
> initializeClassificationTable
> + "Initialize the classification table.
> + The classification table is a compact encoding of upper and lower cases and digits of characters with
> + - bits 0-7: The lower case value of this character or 0, if its greater than 255.
> + - bits 8-15: The upper case value of this character or 0, if its greater than 255.
> + - bit 16: lowercase bit (isLowercase == true)
> + - bit 17: uppercase bit (isUppercase == true)
> + - bit 18: digit bit (isDigit == true)"
> + " self initializeClassificationTable "
> - "
> - Initialize the classification table. The classification table is a
> - compact encoding of upper and lower cases of characters with
>
> + | encodedCharSet |
> + "Base the table on the EncodedCharset of these characters' leadingChar - 0."
> + encodedCharSet := EncodedCharSet charsetAt: 0.
> - - bits 0-7: The lower case value of this character.
> - - bits 8-15: The upper case value of this character.
> - - bit 16: lowercase bit (e.g., isLowercase == true)
> - - bit 17: uppercase bit (e.g., isUppercase == true)
>
> - "
> - | ch1 |
> -
> LowercaseBit := 1 bitShift: 16.
> UppercaseBit := 1 bitShift: 17.
> + DigitBit := 1 bitShift: 18.
>
> + "Initialize the letter mask (e.g., isLetter == true)"
> + LetterMask := LowercaseBit bitOr: UppercaseBit.
> - "Initialize the letter bits (e.g., isLetter == true)"
> - LetterBits := LowercaseBit bitOr: UppercaseBit.
>
> + "Initialize the alphanumeric mask (e.g. isAlphaNumeric == true)"
> + AlphaNumericMask := LetterMask bitOr: DigitBit.
> - ClassificationTable := Array new: 256.
> - "Initialize the defaults (neither lower nor upper case)"
> - 0 to: 255 do:[:i|
> - ClassificationTable at: i+1 put: (i bitShift: 8) + i.
> - ].
>
> + "Initialize the table based on encodedCharSet."
> + ClassificationTable := Array new: 256.
> + 0 to: 255 do: [ :code |
> + | isLowercase isUppercase isDigit lowercaseCode uppercaseCode value |
> + isLowercase := encodedCharSet isLowercaseCode: code.
> + isUppercase := encodedCharSet isUppercaseCode: code.
> + isDigit := encodedCharSet isDigitCode: code.
> + lowercaseCode := encodedCharSet toLowercaseCode: code.
> + lowercaseCode > 255 ifTrue: [ lowercaseCode := 0 ].
> + uppercaseCode := encodedCharSet toUppercaseCode: code.
> + uppercaseCode > 255 ifTrue: [ uppercaseCode := 0 ].
> + value := (uppercaseCode bitShift: 8) + lowercaseCode.
> + isLowercase ifTrue: [ value := value bitOr: LowercaseBit ].
> + isUppercase ifTrue: [ value := value bitOr: UppercaseBit ].
> + isDigit ifTrue: [ value := value bitOr: DigitBit ].
> + ClassificationTable at: code + 1 put: value ]!
> - "Initialize character pairs (upper-lower case)"
> - #(
> - "Basic roman"
> - ($A $a) ($B $b) ($C $c) ($D $d)
> - ($E $e) ($F $f) ($G $g) ($H $h)
> - ($I $i) ($J $j) ($K $k) ($L $l)
> - ($M $m) ($N $n) ($O $o) ($P $p)
> - ($Q $q) ($R $r) ($S $s) ($T $t)
> - ($U $u) ($V $v) ($W $w) ($X $x)
> - ($Y $y) ($Z $z)
> - "International"
> - ($? $?) ($? $?) ($? $?) ($? $?)
> - ($? $?) ($? $?) ($? $?) ($? $?)
> - ($? $?) ($? $?) ($? $?) ($? $?)
> - "International - Spanish"
> - ($? $?) ($? $?) ($? $?) ($? $?)
> - "International - PLEASE CHECK"
> - ($? $?) ($? $?) ($? $?) ($? $?)
> - ($? $?) ($? $?)
> - ($? $?) ($? $?) ($? $?) ($? $?) ($? $?)
> - ) do:[:pair| | ch2 |
> - ch1 := pair first asciiValue.
> - ch2 := pair last asciiValue.
> - ClassificationTable at: ch1+1 put: (ch1 bitShift: 8) + ch2 + UppercaseBit.
> - ClassificationTable at: ch2+1 put: (ch1 bitShift: 8) + ch2 + LowercaseBit.
> - ].
> -
> - "Initialize a few others for which we only have lower case versions."
> - #($? $? $? $?) do:[:char|
> - ch1 := char asciiValue.
> - ClassificationTable at: ch1+1 put: (ch1 bitShift: 8) + ch1 + LowercaseBit.
> - ].
> - !
>
> Item was changed:
> ----- Method: Character>>encodedCharSet (in category 'accessing') -----
> encodedCharSet
> +
> + value < 16r400000 ifTrue: [ ^Unicode ]. "Shortcut"
> -
> ^EncodedCharSet charsetAt: self leadingChar
> !
>
> Item was changed:
> ----- Method: Character>>tokenish (in category 'testing') -----
> tokenish
> + "Answer whether the receiver is a valid token-character--letter, digit, or colon."
> - "Answer whether the receiver is a valid token-character--letter, digit, or
> - colon."
>
> + self == $_ ifTrue: [ ^Scanner prefAllowUnderscoreSelectors ].
> + ^self == $: or: [ self isAlphaNumeric ]!
> - ^ self == $_
> - ifTrue: [ Scanner prefAllowUnderscoreSelectors ]
> - ifFalse: [ self == $: or: [ self isLetter or: [ self isDigit ] ] ]!
>
> Item was changed:
> ----- Method: String>>withoutLineEndings (in category 'converting') -----
> withoutLineEndings
>
> + ^self withLineEndings: ' '!
> - ^ self withSqueakLineEndings
> - copyReplaceAll: String cr
> - with: ' '
> - asTokens: false!
>
> Item was changed:
> ----- Method: Symbol>>numArgs: (in category 'system primitives') -----
> numArgs: n
> "Answer a string that can be used as a selector with n arguments.
> TODO: need to be extended to support shrinking and for selectors like #+ "
>
> + | numArgs offset |.
> + (numArgs := self numArgs) >= n ifTrue: [ ^self ].
> + numArgs = 0
> + ifTrue: [ offset := 1 ]
> + ifFalse: [ offset := 0 ].
> + ^(String new: n - numArgs + offset * 5 + offset + self size streamContents: [ :stream |
> + stream nextPutAll: self.
> + numArgs = 0 ifTrue: [ stream nextPut: $:. ].
> + numArgs + offset + 1 to: n do: [ :i | stream nextPutAll: 'with:' ] ]) asSymbol!
> - | numArgs aStream offs |.
> - (numArgs := self numArgs) >= n ifTrue: [^self].
> - aStream := WriteStream on: (String new: 16).
> - aStream nextPutAll: self.
> -
> - (numArgs = 0) ifTrue: [aStream nextPutAll: ':'. offs := 0] ifFalse: [offs := 1].
> - 2 to: n - numArgs + offs do: [:i | aStream nextPutAll: 'with:'].
> - ^aStream contents asSymbol!
>
> Item was changed:
> + (PackageInfo named: 'Collections') postscript: 'Character initializeClassificationTable'!
> - (PackageInfo named: 'Collections') postscript: 'LRUCache allInstances do: [ :each | each reset ]'!
>
>
>

marcel.taeumel (old)

Re: The Inbox: Collections-ul.624.mcz

Nice! :)