A new version of Collections was added to project The Inbox:
http://source.squeak.org/inbox/Collections-mt.839.mcz ==================== Summary ==================== Name: Collections-mt.839 Author: mt Time: 5 July 2019, 4:54:03.194089 pm UUID: 76695fa3-7ed2-cb4e-8c6e-b33c83dd526f Ancestors: Collections-mt.838 Improves code commentary, structure, and performance. Thanks Levente for the tips! (Adds tests for #findFeatures in CollectionsTests-mt.313). =============== Diff against Collections-mt.838 =============== Item was changed: ----- Method: String>>findFeatureIndicesDo: (in category 'accessing - features') ----- findFeatureIndicesDo: aBlock + "Support for simple analysis of natural language in source code. - "State machine that separates camelCase, UPPERCase, number/operator combinations and skips colons" - | last state char "0 = start, 1 = a, 2 = A, 3 = AA, 4 = num, 5 = op" | + In addition to whitespace separation like #findTokens:, also separate features using higher-level rules: + (1) 'camelCase' -> #('camel' 'Case'), + (2) 'UPPERCase' -> #('UPPER' 'Case'), + (3) integer numbers such as 'MyModel55' -> #('My' 'Model' '55'), and + (4) operators such as '5 <= 4' -> #('5' '<=' '4'). + Other kinds of characters are tokenized as operators: '[state := 2]' -> #('[' 'state' ':=' '2' ']'). - state := 0. - last := 1. + This method works like #lineIndicesDo: and provides start/stop indices of tokens to the given aBlock to, for example, extract and normalize features (or tokens)." + + | last state char | + state := 0. "0 = start, 1 = a, 2 = A, 3 = AA, 4 = num, 5 = op" + last := 1. "last character index" + 1 to: self size do: [ :index | char := self at: index. + + char isLowercase + ifTrue: [ "a" + state < 3 ifTrue: [state := 1]. "*a -> a" + state = 3 ifTrue: [ + "AAa -> A + Aa (camel case follows uppercase)" + aBlock value: last value: index - 2. + last := index - 1. + state := 2]. + state > 3 ifTrue: [ + "+a -> + | a (letter follows non-letter)" + aBlock value: last value: index - 1. + last := index. + state := 1]] - "a" - char isLowercase ifTrue: [ - (state < 3) ifTrue: [state := 1]. "*a -> a" - (state == 3) ifTrue: [ - "AAa -> A + Aa (camel case follows uppercase)" - aBlock value: last value: index - 2. - last := index - 1. - state := 2]. - (state > 3) ifTrue: [ - "+a -> + | a (letter follows non-letter)" - aBlock value: last value: index - 1. - last := index. - state := 1]] - ifFalse: [ - char isUppercase ifTrue: [ - (state == 0) - ifTrue: [state := 2] "start -> A" - ifFalse: [ - (state < 2 or: [state > 3]) ifTrue: [ - "*A -> * | A (uppercase begins, flush before)" - aBlock value: last value: index - 1. - last := index. - state := 2] ifFalse: [ - "AA -> AA (uppercase continues)" - state := 3]]] ifFalse: [ + char isUppercase + ifTrue: [ "A" + state = 0 + ifTrue: [state := 2] "start -> A" - ("char == $: or:" char isSeparator) ifTrue: [ - "skip colon/whitespace" - (state > 0) ifTrue: [ - aBlock value: last value: index - 1. - state := 0]. - last := index + 1] - ifFalse: [ - char isDigit ifTrue: [ - (state == 0) - ifTrue: [state := 4] ifFalse: [ + (state < 2) | (state > 3) ifTrue: [ + "*A -> * | A (uppercase begins, flush before)" + aBlock value: last value: index - 1. + last := index. + state := 2] ifFalse: [ + "AA -> AA (uppercase continues)" + state := 3]]] + ifFalse: [ + char isSeparator + ifTrue: [ " " + "skip whitespace" + state > 0 ifTrue: [ + aBlock value: last value: index - 1. + state := 0]. + last := index + 1] + ifFalse: [ + + char isDigit + ifTrue: [ "num" + state = 0 + ifTrue: [state := 4] + ifFalse: [ + state ~= 4 ifTrue: [ + aBlock value: last value: index - 1. + last := index. + state := 4]]] + ifFalse: [ "op" + state = 0 + ifTrue: [state := 5] + ifFalse: [ + state < 5 ifTrue: [ + aBlock value: last value: index - 1. + last := index. + state := 5]] ] ] ] ] ]. - (state ~= 4) ifTrue: [ - aBlock value: last value: index - 1. - last := index. - state := 4]]] - ifFalse: [ - (state == 0) - ifTrue: [state := 5] - ifFalse: [ - (state < 5) ifTrue: [ - aBlock value: last value: index - 1. - last := index. - state := 5]]]]]]]. last <= self size ifTrue: [ aBlock value: last value: self size]! Item was changed: ----- Method: String>>findFeaturesDo: (in category 'accessing - features') ----- findFeaturesDo: aBlock + "Simple analysis of natural language in source code. Select all features that are letters only, normalize them as lowercase. No support for word stemming. + + Example: + 'Transcript show: 123 asString; cr; show: #HelloWorld.' + -> #('transcript' 'show' 'as' 'string' 'cr' 'show' 'hello' 'world') + " - "Simple analysis for natural language in source code. No support for word stemming." self findFeatureIndicesDo: [:start :end | (self at: start) isLetter ifTrue: [ aBlock value: (self copyFrom: start to: end) asLowercase]].! Item was added: + ----- Method: String>>findTokens (in category 'accessing - tokens') ----- + findTokens + + ^ self findTokens: Character separators! Item was changed: ----- Method: String>>findTokens: (in category 'accessing') ----- findTokens: delimiters + "Answer the collection of tokens between delimiters, which results from parsing self." - "Answer the collection of tokens that result from parsing self." + | tokens | + tokens := OrderedCollection new. + self + findTokens: delimiters + do: [:token | tokens addLast: token]. + ^ tokens! - ^ OrderedCollection streamContents: [:tokens | - self - findTokens: delimiters - do: [:token | tokens nextPut: token]]! Item was changed: + ----- Method: String>>findTokens:do: (in category 'accessing - tokens') ----- - ----- Method: String>>findTokens:do: (in category 'accessing') ----- findTokens: delimiters do: aBlock self findTokens: delimiters indicesDo: [:start :end | aBlock value: (self copyFrom: start to: end)].! Item was changed: + ----- Method: String>>findTokens:indicesDo: (in category 'accessing - tokens') ----- + findTokens: oneOrMoreCharacters indicesDo: aBlock + "Parse self to find tokens between delimiters. Any character in the Collection delimiters marks a border. Several delimiters in a row are considered as just one separation. The interface is similar to #lineIndicesDo:." - ----- Method: String>>findTokens:indicesDo: (in category 'accessing') ----- - findTokens: delimiters indicesDo: aBlock - "Parse self to find tokens between delimiters. Any character in the Collection delimiters marks a border. Several delimiters in a row are considered as just one separation. Also, allow delimiters to be a single character. Similar to #lineIndicesDo:." + | keyStart keyStop separators size | + size := self size. + separators := oneOrMoreCharacters isCharacter + ifTrue: [{oneOrMoreCharacters}] + ifFalse: [oneOrMoreCharacters]. - | tokens keyStart keyStop separators | - separators := delimiters isCharacter - ifTrue: [Array with: delimiters] - ifFalse: [delimiters]. keyStop := 1. + [keyStop <= size] whileTrue: [ - [keyStop <= self size] whileTrue: [ keyStart := self skipDelimiters: separators startingAt: keyStop. keyStop := self findDelimiters: separators startingAt: keyStart. keyStart < keyStop ifTrue: [aBlock value: keyStart value: keyStop - 1]].! |
Free forum by Nabble | Edit this page |