Levente Uzonyi uploaded a new version of Multilingual to project The Trunk:
http://source.squeak.org/trunk/Multilingual-ul.208.mcz ==================== Summary ==================== Name: Multilingual-ul.208 Author: ul Time: 1 May 2015, 3:25:18.828 pm UUID: 82d19dac-c602-4c0d-bc9a-7858e3a3c283 Ancestors: Multilingual-ul.206 Improved Unicode caseMappings: - Don't overwrite an existing mapping, because that leads to problems (like (Unicode toUppercaseCode: $k asciiValue) = 8490) - Use PluggableDictionary class >> #integerDictionary for better lookup performance (~+16%), and compaction resistance (done at every release). - Compact the dictionaries before saving. - Save the new dictionaries atomically. =============== Diff against Multilingual-ul.206 =============== Item was changed: ----- Method: Unicode class>>initializeCaseMappings (in category 'casing') ----- initializeCaseMappings "Unicode initializeCaseMappings" + + UIManager default informUserDuring: [ :bar | - ToCasefold := IdentityDictionary new. - ToUpper := IdentityDictionary new. - ToLower := IdentityDictionary new. - UIManager default informUserDuring: [:bar| | stream | bar value: 'Downloading Unicode data'. stream := HTTPClient httpGet: 'http://www.unicode.org/Public/UNIDATA/CaseFolding.txt'. (stream isKindOf: RWBinaryOrTextStream) ifFalse:[^self error: 'Download failed']. stream reset. bar value: 'Updating Case Mappings'. + self parseCaseMappingFrom: stream ].! - self parseCaseMappingFrom: stream. - ].! Item was changed: ----- Method: Unicode class>>parseCaseMappingFrom: (in category 'casing') ----- parseCaseMappingFrom: stream "Parse the Unicode casing mappings from the given stream. Handle only the simple mappings" " Unicode initializeCaseMappings. " + | newToCasefold newToUpper newToLower casefoldKeys | + newToCasefold := PluggableDictionary integerDictionary. + newToUpper := PluggableDictionary integerDictionary. + newToLower := PluggableDictionary integerDictionary. - ToCasefold := IdentityDictionary new: 2048. - ToUpper := IdentityDictionary new: 2048. - ToLower := IdentityDictionary new: 2048. + "Filter the mappings (Simple and Common) to newToCasefold." + stream contents linesDo: [ :line | + | data fields sourceCode destinationCode | + data := line copyUpTo: $#. + fields := data findTokens: '; '. + (fields size > 2 and: [ #('C' 'S') includes: (fields at: 2) ]) ifTrue:[ + sourceCode := Integer readFrom: (fields at: 1) base: 16. + destinationCode := Integer readFrom: (fields at: 3) base: 16. + newToCasefold at: sourceCode put: destinationCode ] ]. - [stream atEnd] whileFalse:[ - | fields line srcCode dstCode | - line := stream nextLine copyUpTo: $#. - fields := line withBlanksTrimmed findTokens: $;. - (fields size > 2 and: [#('C' 'S') includes: (fields at: 2) withBlanksTrimmed]) ifTrue:[ - srcCode := Integer readFrom: (fields at: 1) withBlanksTrimmed base: 16. - dstCode := Integer readFrom: (fields at: 3) withBlanksTrimmed base: 16. - ToCasefold at: srcCode put: dstCode. - ]. - ]. + casefoldKeys := newToCasefold keys. + newToCasefold keysAndValuesDo: [ :sourceCode :destinationCode | + (self isUppercaseCode: sourceCode) ifTrue: [ + "In most cases, uppercase letter are folded to lower case" + newToUpper at: destinationCode put: sourceCode. + newToLower at: sourceCode ifAbsentPut: destinationCode "Don't overwrite existing pairs. To avoid $k asUppercase to return the Kelvin character (8490)." ]. + (self isLowercaseCode: sourceCode) ifTrue: [ + "In a few cases, two upper case letters are folded to the same lower case. + We must find an upper case letter folded to the same letter" + casefoldKeys + detect: [ :each | + (self isUppercaseCode: each) and: [ + (newToCasefold at: each) = destinationCode ] ] + ifFound: [ :uppercaseCode | + newToUpper at: sourceCode put: uppercaseCode ] + ifNone: [ ] ] ]. + + "Compact the dictionaries." + newToCasefold compact. + newToUpper compact. + newToLower compact. + "Save in an atomic operation." + ToCasefold := newToCasefold. + ToUpper := newToUpper. + ToLower := newToLower + ! - ToCasefold keysAndValuesDo: - [:k :v | - (self isUppercaseCode: k) - ifTrue: - ["In most cases, uppercase letter are folded to lower case" - ToUpper at: v put: k. - ToLower at: k put: v]. - (self isLowercaseCode: k) - ifTrue: - ["In a few cases, two upper case letters are folded to the same lower case. - We must find an upper case letter folded to the same letter" - | up | - up := ToCasefold keys detect: [:e | (self isUppercaseCode: e) and: [(ToCasefold at: e) = v]] ifNone: [nil]. - up ifNotNil: [ToUpper at: k put: up]]].! |
Ouch, yes, extracting simple case mapping from full CaseFolding data was probably a mistake... Thanks for reviewing, and as we say, vieux motard que jamais (better late than never) - it's almost 5 years old Next job will be to comment Unicode class, and explain which unicode operation is supported... -------------------------- Multilingual-nice.123 Author: nice Time: 14 July 2010, 1:17:02.219 pm UUID: ec8f05b8-78a6-4496-aca9-8f9b2e54823d Ancestors: Multilingual-ul.122 1) simplify a case of at:ifAbsentPut: pattern in SparseXTable 2) provide a simple mapping of unicode upper/lower case characters as described at http://unicode.org/reports/tr21/tr21-5.html Note 1: Unicode class now provides two utilities to transform case of a String rather than of a Character. This is for enabling future enhancements like handling special casings when case folding does change the number of characters. Note 2: there is no automatic initialization performed yet. You'll have to execute this before using above utilities: Unicode initializeCaseMappings. This is only an unoptimized, first attempt proposal. Comments and changes are of course welcome. 2015-05-03 2:15 GMT+02:00 <[hidden email]>: Levente Uzonyi uploaded a new version of Multilingual to project The Trunk: |
Free forum by Nabble | Edit this page |