The Trunk: Multilingual-ul.208.mcz

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

The Trunk: Multilingual-ul.208.mcz

commits-2
Levente Uzonyi uploaded a new version of Multilingual to project The Trunk:
http://source.squeak.org/trunk/Multilingual-ul.208.mcz

==================== Summary ====================

Name: Multilingual-ul.208
Author: ul
Time: 1 May 2015, 3:25:18.828 pm
UUID: 82d19dac-c602-4c0d-bc9a-7858e3a3c283
Ancestors: Multilingual-ul.206

Improved Unicode caseMappings:
- Don't overwrite an existing mapping, because that leads to problems (like (Unicode toUppercaseCode: $k asciiValue) = 8490)
- Use PluggableDictionary class >> #integerDictionary for better lookup performance (~+16%), and compaction resistance (done at every release).
- Compact the dictionaries before saving.
- Save the new dictionaries atomically.

=============== Diff against Multilingual-ul.206 ===============

Item was changed:
  ----- Method: Unicode class>>initializeCaseMappings (in category 'casing') -----
  initializeCaseMappings
  "Unicode initializeCaseMappings"
+
+ UIManager default informUserDuring: [ :bar |
- ToCasefold := IdentityDictionary new.
- ToUpper := IdentityDictionary new.
- ToLower := IdentityDictionary new.
- UIManager default informUserDuring: [:bar|
  | stream |
  bar value: 'Downloading Unicode data'.
  stream := HTTPClient httpGet: 'http://www.unicode.org/Public/UNIDATA/CaseFolding.txt'.
  (stream isKindOf: RWBinaryOrTextStream) ifFalse:[^self error: 'Download failed'].
  stream reset.
  bar value: 'Updating Case Mappings'.
+ self parseCaseMappingFrom: stream ].!
- self parseCaseMappingFrom: stream.
- ].!

Item was changed:
  ----- Method: Unicode class>>parseCaseMappingFrom: (in category 'casing') -----
  parseCaseMappingFrom: stream
  "Parse the Unicode casing mappings from the given stream.
  Handle only the simple mappings"
  "
  Unicode initializeCaseMappings.
  "
 
+ | newToCasefold newToUpper newToLower casefoldKeys |
+ newToCasefold := PluggableDictionary integerDictionary.
+ newToUpper := PluggableDictionary integerDictionary.
+ newToLower := PluggableDictionary integerDictionary.
- ToCasefold := IdentityDictionary new: 2048.
- ToUpper := IdentityDictionary new: 2048.
- ToLower := IdentityDictionary new: 2048.
 
+ "Filter the mappings (Simple and Common) to newToCasefold."
+ stream contents linesDo: [ :line |
+ | data fields sourceCode destinationCode |
+ data := line copyUpTo: $#.
+ fields := data findTokens: '; '.
+ (fields size > 2 and: [ #('C' 'S') includes: (fields at: 2) ]) ifTrue:[
+ sourceCode := Integer readFrom: (fields at: 1) base: 16.
+ destinationCode := Integer readFrom: (fields at: 3) base: 16.
+ newToCasefold at: sourceCode put: destinationCode ] ].
- [stream atEnd] whileFalse:[
- | fields line srcCode dstCode |
- line := stream nextLine copyUpTo: $#.
- fields := line withBlanksTrimmed findTokens: $;.
- (fields size > 2 and: [#('C' 'S') includes: (fields at: 2) withBlanksTrimmed]) ifTrue:[
- srcCode := Integer readFrom: (fields at: 1) withBlanksTrimmed base: 16.
- dstCode := Integer readFrom: (fields at: 3) withBlanksTrimmed base: 16.
- ToCasefold at: srcCode put: dstCode.
- ].
- ].
 
+ casefoldKeys := newToCasefold keys.
+ newToCasefold keysAndValuesDo: [ :sourceCode :destinationCode |
+ (self isUppercaseCode: sourceCode) ifTrue: [
+ "In most cases, uppercase letter are folded to lower case"
+ newToUpper at: destinationCode put: sourceCode.
+ newToLower at: sourceCode ifAbsentPut: destinationCode "Don't overwrite existing pairs. To avoid $k asUppercase to return the Kelvin character (8490)." ].
+ (self isLowercaseCode: sourceCode) ifTrue: [
+ "In a few cases, two upper case letters are folded to the same lower case.
+ We must find an upper case letter folded to the same letter"
+ casefoldKeys
+ detect: [ :each |
+ (self isUppercaseCode: each) and: [
+ (newToCasefold at: each) = destinationCode ] ]
+ ifFound: [ :uppercaseCode |
+ newToUpper at: sourceCode put: uppercaseCode ]
+ ifNone: [ ] ] ].
+
+ "Compact the dictionaries."
+ newToCasefold compact.
+ newToUpper compact.
+ newToLower compact.
+ "Save in an atomic operation."
+ ToCasefold := newToCasefold.
+ ToUpper := newToUpper.
+ ToLower := newToLower
+ !
- ToCasefold keysAndValuesDo:
- [:k :v |
- (self isUppercaseCode: k)
- ifTrue:
- ["In most cases, uppercase letter are folded to lower case"
- ToUpper at: v put: k.
- ToLower at: k put: v].
- (self isLowercaseCode: k)
- ifTrue:
- ["In a few cases, two upper case letters are folded to the same lower case.
- We must find an upper case letter folded to the same letter"
- | up |
- up := ToCasefold keys detect: [:e | (self isUppercaseCode: e) and: [(ToCasefold at: e) = v]] ifNone: [nil].
- up ifNotNil: [ToUpper at: k put: up]]].!


Reply | Threaded
Open this post in threaded view
|

Re: The Trunk: Multilingual-ul.208.mcz

Nicolas Cellier
Ouch, yes, extracting simple case mapping from full CaseFolding data was probably a mistake...
Thanks for reviewing, and as we say, vieux motard que jamais (better late than never) - it's almost 5 years old

Next job will be to comment Unicode class, and explain which unicode operation is supported...

--------------------------

Multilingual-nice.123
Author: nice
Time: 14 July 2010, 1:17:02.219 pm
UUID: ec8f05b8-78a6-4496-aca9-8f9b2e54823d
Ancestors: Multilingual-ul.122

1) simplify a case of at:ifAbsentPut: pattern in SparseXTable
2) provide a simple mapping of unicode upper/lower case characters as described at http://unicode.org/reports/tr21/tr21-5.html

Note 1: Unicode class now provides two utilities to transform case of a String rather than of a Character. This is for enabling future enhancements like handling special casings when case folding does change the number of characters.

Note 2: there is no automatic initialization performed yet. You'll have to execute this before using above utilities:
Unicode initializeCaseMappings.

This is only an unoptimized, first attempt proposal. Comments and changes are of course welcome.

2015-05-03 2:15 GMT+02:00 <[hidden email]>:
Levente Uzonyi uploaded a new version of Multilingual to project The Trunk:
http://source.squeak.org/trunk/Multilingual-ul.208.mcz

==================== Summary ====================

Name: Multilingual-ul.208
Author: ul
Time: 1 May 2015, 3:25:18.828 pm
UUID: 82d19dac-c602-4c0d-bc9a-7858e3a3c283
Ancestors: Multilingual-ul.206

Improved Unicode caseMappings:
- Don't overwrite an existing mapping, because that leads to problems (like (Unicode toUppercaseCode: $k asciiValue) = 8490)
- Use PluggableDictionary class >> #integerDictionary for better lookup performance (~+16%), and compaction resistance (done at every release).
- Compact the dictionaries before saving.
- Save the new dictionaries atomically.

=============== Diff against Multilingual-ul.206 ===============

Item was changed:
  ----- Method: Unicode class>>initializeCaseMappings (in category 'casing') -----
  initializeCaseMappings
        "Unicode initializeCaseMappings"
+
+       UIManager default informUserDuring: [ :bar |
-       ToCasefold := IdentityDictionary new.
-       ToUpper := IdentityDictionary new.
-       ToLower := IdentityDictionary new.
-       UIManager default informUserDuring: [:bar|
                | stream |
                bar value: 'Downloading Unicode data'.
                stream := HTTPClient httpGet: 'http://www.unicode.org/Public/UNIDATA/CaseFolding.txt'.
                (stream isKindOf: RWBinaryOrTextStream) ifFalse:[^self error: 'Download failed'].
                stream reset.
                bar value: 'Updating Case Mappings'.
+               self parseCaseMappingFrom: stream ].!
-               self parseCaseMappingFrom: stream.
-       ].!

Item was changed:
  ----- Method: Unicode class>>parseCaseMappingFrom: (in category 'casing') -----
  parseCaseMappingFrom: stream
        "Parse the Unicode casing mappings from the given stream.
        Handle only the simple mappings"
        "
                Unicode initializeCaseMappings.
        "

+       | newToCasefold newToUpper newToLower casefoldKeys |
+       newToCasefold := PluggableDictionary integerDictionary.
+       newToUpper := PluggableDictionary integerDictionary.
+       newToLower := PluggableDictionary integerDictionary.
-       ToCasefold := IdentityDictionary new: 2048.
-       ToUpper := IdentityDictionary new: 2048.
-       ToLower := IdentityDictionary new: 2048.

+       "Filter the mappings (Simple and Common) to newToCasefold."
+       stream contents linesDo: [ :line |
+               | data fields sourceCode destinationCode |
+               data := line copyUpTo: $#.
+               fields := data findTokens: '; '.
+               (fields size > 2 and: [ #('C' 'S') includes: (fields at: 2) ]) ifTrue:[
+                       sourceCode := Integer readFrom: (fields at: 1) base: 16.
+                       destinationCode := Integer readFrom: (fields at: 3) base: 16.
+                       newToCasefold at: sourceCode put: destinationCode ] ].
-       [stream atEnd] whileFalse:[
-               | fields line srcCode dstCode |
-               line := stream nextLine copyUpTo: $#.
-               fields := line withBlanksTrimmed findTokens: $;.
-               (fields size > 2 and: [#('C' 'S') includes: (fields at: 2) withBlanksTrimmed]) ifTrue:[
-                       srcCode := Integer readFrom: (fields at: 1) withBlanksTrimmed base: 16.
-                       dstCode := Integer readFrom: (fields at: 3) withBlanksTrimmed base: 16.
-                       ToCasefold at: srcCode put: dstCode.
-               ].
-       ].

+       casefoldKeys := newToCasefold keys.
+       newToCasefold keysAndValuesDo: [ :sourceCode :destinationCode |
+               (self isUppercaseCode: sourceCode) ifTrue: [
+                       "In most cases, uppercase letter are folded to lower case"
+                       newToUpper at: destinationCode put: sourceCode.
+                       newToLower at: sourceCode ifAbsentPut: destinationCode "Don't overwrite existing pairs. To avoid $k asUppercase to return the Kelvin character (8490)." ].
+               (self isLowercaseCode: sourceCode) ifTrue: [
+                       "In a few cases, two upper case letters are folded to the same lower case.
+                       We must find an upper case letter folded to the same letter"
+                       casefoldKeys
+                               detect: [ :each |
+                                       (self isUppercaseCode: each) and: [
+                                               (newToCasefold at: each) = destinationCode ] ]
+                               ifFound: [ :uppercaseCode |
+                                       newToUpper at: sourceCode put: uppercaseCode ]
+                               ifNone: [ ] ] ].
+
+       "Compact the dictionaries."
+       newToCasefold compact.
+       newToUpper compact.
+       newToLower compact.
+       "Save in an atomic operation."
+       ToCasefold := newToCasefold.
+       ToUpper := newToUpper.
+       ToLower := newToLower
+       !
-       ToCasefold keysAndValuesDo:
-               [:k :v |
-               (self isUppercaseCode: k)
-                       ifTrue:
-                               ["In most cases, uppercase letter are folded to lower case"
-                               ToUpper at: v put: k.
-                               ToLower at: k put: v].
-               (self isLowercaseCode: k)
-                       ifTrue:
-                               ["In a few cases, two upper case letters are folded to the same lower case.
-                               We must find an upper case letter folded to the same letter"
-                               | up |
-                               up := ToCasefold keys detect: [:e | (self isUppercaseCode: e) and: [(ToCasefold at: e) = v]] ifNone: [nil].
-                               up ifNotNil: [ToUpper at: k put: up]]].!