Smalltalk › Squeak › Squeak - Dev

The Trunk: Collections-ul.800.mcz

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

1 message

commits-2

The Trunk: Collections-ul.800.mcz

Levente Uzonyi uploaded a new version of Collections to project The Trunk:
http://source.squeak.org/trunk/Collections-ul.800.mcz

==================== Summary ====================

Name: Collections-ul.800
Author: ul
Time: 2 July 2018, 11:44:52.435558 pm
UUID: 369bd2ae-d7b3-40cd-acbd-b731aee61d94
Ancestors: Collections-ul.799

Character changes:
- deprecated #characterTable
- removed the CharacterTable class variable
- store ClassificationTable data in a WordArray (reinitialized by the package postscript)
- use self to create character constants (#euro, #nbsp)

String optimizations for
- #findFirstInString:inSet:startingAt:
- #isAllDigits
- #isOctetString

=============== Diff against Collections-ul.799 ===============

Item was changed:
Magnitude immediateSubclass: #Character
instanceVariableNames: ''
+ classVariableNames: 'AlphaNumericMask ClassificationTable DigitBit DigitValues LetterMask LowercaseBit UppercaseBit'
- classVariableNames: 'AlphaNumericMask CharacterTable ClassificationTable DigitBit DigitValues LetterMask LowercaseBit UppercaseBit'
poolDictionaries: ''
category: 'Collections-Strings'!

!Character commentStamp: 'eem 8/12/2014 14:53' prior: 0!
I represent a character by storing its associated Unicode as an unsigned 30-bit value. Characters are created uniquely, so that all instances of a particular Unicode are identical. My instances are encoded in tagged pointers in the VM, so called immediates, and therefore are pure immutable values.

The code point is based on Unicode. Since Unicode is 21-bit wide character set, we have several bits available for other information. As the Unicode Standard states, a Unicode code point doesn't carry the language information. This is going to be a problem with the languages so called CJK (Chinese, Japanese, Korean. Or often CJKV including Vietnamese). Since the characters of those languages are unified and given the same code point, it is impossible to display a bare Unicode code point in an inspector or such tools. To utilize the extra available bits, we use them for identifying the languages. Since the old implementation uses the bits to identify the character encoding, the bits are sometimes called "encoding tag" or neutrally "leading char", but the bits rigidly denotes the concept of languages.

The other languages can have the language tag if you like. This will help to break the large default font (font set) into separately loadable chunk of fonts. However, it is open to the each native speakers and writers to decide how to define the character equality, since the same Unicode code point may have different language tag thus simple #= comparison may return false.!

Item was removed:
- ----- Method: Character class>>characterTable (in category 'constants') -----
- characterTable
- "Answer the class variable in which unique Characters are stored."
-
- ^CharacterTable!

Item was changed:
----- Method: Character class>>euro (in category 'accessing untypeable characters') -----
euro
"The Euro currency sign, that E with two dashes. The code point is a official unicode ISO/IEC-10646-1"

+ ^self value: 16r20AC!
- ^ Unicode value: 16r20AC!

Item was changed:
----- Method: Character class>>initializeClassificationTable (in category 'class initialization') -----
initializeClassificationTable
"Initialize the classification table.
The classification table is a compact encoding of upper and lower cases and digits of characters with
- bits 0-7: The lower case value of this character or 0, if its greater than 255.
- bits 8-15: The upper case value of this character or 0, if its greater than 255.
- bit 16: lowercase bit (isLowercase == true)
- bit 17: uppercase bit (isUppercase == true)
- bit 18: digit bit (isDigit == true)"
" self initializeClassificationTable "

| encodedCharSet newClassificationTable |
"Base the table on the EncodedCharset of these characters' leadingChar - 0."
encodedCharSet := EncodedCharSet charsetAt: 0.

LowercaseBit := 1 bitShift: 16.
UppercaseBit := 1 bitShift: 17.
DigitBit := 1 bitShift: 18.

"Initialize the letter mask (e.g., isLetter == true)"
LetterMask := LowercaseBit bitOr: UppercaseBit.

"Initialize the alphanumeric mask (e.g. isAlphaNumeric == true)"
AlphaNumericMask := LetterMask bitOr: DigitBit.

"Initialize the table based on encodedCharSet."
+ newClassificationTable := WordArray new: 256.
- newClassificationTable := Array new: 256.
0 to: 255 do: [ :code |
| isLowercase isUppercase isDigit lowercaseCode uppercaseCode value |
isLowercase := encodedCharSet isLowercaseCode: code.
isUppercase := encodedCharSet isUppercaseCode: code.
isDigit := encodedCharSet isDigitCode: code.
lowercaseCode := encodedCharSet toLowercaseCode: code.
lowercaseCode > 255 ifTrue: [ lowercaseCode := 0 ].
uppercaseCode := encodedCharSet toUppercaseCode: code.
uppercaseCode > 255 ifTrue: [ uppercaseCode := 0 ].
value := (uppercaseCode bitShift: 8) + lowercaseCode.
isLowercase ifTrue: [ value := value bitOr: LowercaseBit ].
isUppercase ifTrue: [ value := value bitOr: UppercaseBit ].
isDigit ifTrue: [ value := value bitOr: DigitBit ].
newClassificationTable at: code + 1 put: value ].
ClassificationTable := newClassificationTable!

Item was changed:
----- Method: Character class>>nbsp (in category 'accessing untypeable characters') -----
nbsp
+ "non-breakable space"
- "non-breakable space. Latin1 encoding common usage."

+ ^self value: 160!
- ^ Character value: 160!

Item was changed:
----- Method: String class>>findFirstInString:inSet:startingAt: (in category 'primitives') -----
findFirstInString: aString inSet: inclusionMap startingAt: start
"Trivial, non-primitive version"

+ | i stringSize ascii |
+ inclusionMap size ~= 256 ifTrue: [ ^0 ].
- | i stringSize ascii more |
- inclusionMap size ~= 256 ifTrue: [^ 0].
stringSize := aString size.
- more := true.
i := start - 1.
+ [ (i := i + 1) <= stringSize ] whileTrue: [
+ (ascii := aString basicAt: i) < 256 ifTrue: [
+ (inclusionMap at: ascii + 1) = 0 ifFalse: [ ^i ] ] ].
+ ^0!
- [more and: [(i := i + 1) <= stringSize]] whileTrue: [
- ascii := aString basicAt: i.
- more := ascii < 256 ifTrue: [(inclusionMap at: ascii + 1) = 0] ifFalse: [true].
- ].
-
- i > stringSize ifTrue: [^ 0].
- ^ i!

Item was changed:
----- Method: String>>isAllDigits (in category 'testing') -----
isAllDigits
"whether the receiver is composed entirely of digits"
+
+ ^self allSatisfy: [ :character | character isDigit ]!
- self do: [:c | c isDigit ifFalse: [^ false]].
- ^ true!

Item was changed:
----- Method: String>>isOctetString (in category 'testing') -----
isOctetString
"Answer whether the receiver can be represented as a byte string.
This is different from asking whether the receiver *is* a ByteString
(i.e., #isByteString)"
1 to: self size do: [:pos |
+ (self basicAt: pos) >= 256 ifTrue: [^ false].
- (self at: pos) asInteger >= 256 ifTrue: [^ false].
].
^ true.
!

Item was changed:
+ (PackageInfo named: 'Collections') postscript: 'Character initializeClassificationTable'!
- (PackageInfo named: 'Collections') postscript: 'CharacterSet allInstancesDo: [:e | ByteCharacterSet adoptInstance: e ]'!