Smalltalk › Squeak › Squeak - Dev

The Trunk: Collections-nice.336.mcz

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

1 message

commits-2

The Trunk: Collections-nice.336.mcz

Nicolas Cellier uploaded a new version of Collections to project The Trunk:
http://source.squeak.org/trunk/Collections-nice.336.mcz

==================== Summary ====================

Name: Collections-nice.336
Author: nice
Time: 14 March 2010, 11:30:32.668 pm
UUID: 18f05259-ecd0-4bbf-b7ae-ef5319522e2e
Ancestors: Collections-HenrikSperreJohansen.335

1) Cache Character DigitValues to gain some speed.
Warning: continue parsing lowercase digits, though it was not consensual.
Note: class var initialization testing will be removed in nxt release. It's only an upgrade guard.
2) Avoid using size == 0

Benchmark:
['0123456789' do: [:e | e digitValue]] bench
AFTER '233969.8060387922 per second.' '236418.5162967407 per second.'
BEFORE '188964.4071185763 per second.' '197284.9430113977 per second.'

['0123456789ABCDEF' do: [:e | e digitValue]] bench
AFTER '155123.375324935 per second.' '152030.1939612078 per second.'
BEFORE '120782.4435112977 per second.' '119901.4197160568 per second.'

['0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ' do: [:e | e digitValue]] bench
AFTER '73469.7060587882 per second.' '73144.3711257749 per second.'
BEFORE '55508.49830033993 per second.' '55637.2725454909 per second.'

['0123456789abcdefghijklmnopqrstuvwxyz' do: [:e | e digitValue]] bench
AFTER '71603.8792241552 per second.' '72621.875624875 per second.'
BEFORE '21194.16116776645 per second.' '21273.34533093381 per second.'

=============== Diff against Collections-HenrikSperreJohansen.335 ===============

Item was changed:
----- Method: Character class>>initialize (in category 'class initialization') -----
initialize
+ "Create the table of unique Characters, and DigitsValues."
+ "Character initializeClassificationTable"
+
+ CharacterTable ifNil: [
+ "Initialize only once to ensure that byte characters are unique"
+ CharacterTable := Array new: 256.
+ 1 to: 256 do: [:i | CharacterTable at: i put: (self basicNew setValue: i - 1)]].
+ self initializeDigitValues!
- "Create the table of unique Characters."
- " self initializeClassificationTable"!

Item was changed:
----- Method: String>>asLegalSelector (in category 'converting') -----
asLegalSelector
| toUse |
toUse := self select: [:char | char isAlphaNumeric].
+ (toUse size = 0 or: [toUse first isLetter not])
+ ifTrue: [toUse := 'v', toUse].
- (toUse size == 0 or: [toUse first isLetter not])
- ifTrue: [toUse := 'v', toUse].
^ toUse withFirstCharacterDownshifted!

Item was changed:
----- Method: Character>>digitValue (in category 'accessing') -----
digitValue
"Answer 0-9 if the receiver is $0-$9, 10-35 if it is $A-$Z, and < 0
otherwise. This is used to parse literal numbers of radix 2-36."

+ | code |
+ (code := self charCode) > 16rFF ifTrue: [^(EncodedCharSet charsetAt: self leadingChar) digitValueOf: self].
+ DigitValues ifNil: [self class initializeDigitValues].
+ ^DigitValues at: 1 + code!
- | digitValue |
- (digitValue := ('0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ' indexOf: self) - 1) >= 0
- ifTrue: [ ^digitValue ].
- ^ (EncodedCharSet charsetAt: self leadingChar) digitValueOf: self.
- !

Item was changed:
Magnitude subclass: #Character
instanceVariableNames: 'value'
+ classVariableNames: 'CharacterTable ClassificationTable DigitValues LetterBits LowercaseBit UppercaseBit'
- classVariableNames: 'CharacterTable ClassificationTable LetterBits LowercaseBit UppercaseBit'
poolDictionaries: ''
category: 'Collections-Strings'!

!Character commentStamp: 'ar 4/9/2005 22:35' prior: 0!
I represent a character by storing its associated Unicode. The first 256 characters are created uniquely, so that all instances of latin1 characters ($R, for example) are identical.

The code point is based on Unicode. Since Unicode is 21-bit wide character set, we have several bits available for other information. As the Unicode Standard states, a Unicode code point doesn't carry the language information. This is going to be a problem with the languages so called CJK (Chinese, Japanese, Korean. Or often CJKV including Vietnamese). Since the characters of those languages are unified and given the same code point, it is impossible to display a bare Unicode code point in an inspector or such tools. To utilize the extra available bits, we use them for identifying the languages. Since the old implementation uses the bits to identify the character encoding, the bits are sometimes called "encoding tag" or neutrally "leading char", but the bits rigidly denotes the concept of languages.

The other languages can have the language tag if you like. This will help to break the large default font (font set) into separately loadable chunk of fonts. However, it is open to the each native speakers and writers to decide how to define the character equality, since the same Unicode code point may have different language tag thus simple #= comparison may return false.

I represent a character by storing its associated ASCII code (extended to 256 codes). My instances are created uniquely, so that all instances of a character ($R, for example) are identical.!

Item was added:
+ ----- Method: Character class>>initializeDigitValues (in category 'class initialization') -----
+ initializeDigitValues
+ "Initialize the well known digit value of ascii characters.
+ Note that the DigitValues table is 1-based while ascii values are 0-based, thus the offset +1."
+
+ DigitValues := Array new: 256 withAll: -1.
+ "the digits"
+ 0 to: 9 do: [:i | DigitValues at: 48 + i + 1 put: i].
+ "the uppercase letters"
+ 10 to: 35 do: [:i | DigitValues at: 55 + i + 1 put: i].
+ "the lowercase letters"
+ 10 to: 35 do: [:i | DigitValues at: 87 + i + 1 put: i].!