A new version of Collections was added to project The Inbox:
http://source.squeak.org/inbox/Collections-mt.838.mcz ==================== Summary ==================== Name: Collections-mt.838 Author: mt Time: 4 July 2019, 4:32:40.854026 pm UUID: ac4ab442-79c0-d246-8dec-914be7ee5356 Ancestors: Collections-pre.837 To String, adds simple analysis of natural language in source code. No word stemming. 1) Refactor #findTokens: to look like #lines (i.e. #linesDo: and #lineIndicesDo:). 2) Add #findFeaturesDo: like #findTokens:do: and #linesDo:. Try this: HTTPDownloadRequest name findFeatures. (Morph >> #drawOn:) getSource asString findFeatures. Where can that be useful? - Automatic insertion of "*" for search terms like "WeakDictionary" to also find WeakIdentityDictionary etc. - Prefix emphasis for names lists of classes in code browsers: MCAddition, MCAncestry, etc. =============== Diff against Collections-pre.837 =============== Item was added: + ----- Method: String>>findFeatureIndicesDo: (in category 'accessing - features') ----- + findFeatureIndicesDo: aBlock + "State machine that separates camelCase, UPPERCase, number/operator combinations and skips colons" + | last state char "0 = start, 1 = a, 2 = A, 3 = AA, 4 = num, 5 = op" | + + state := 0. + last := 1. + + 1 to: self size do: [ :index | + char := self at: index. + "a" + char isLowercase ifTrue: [ + (state < 3) ifTrue: [state := 1]. "*a -> a" + (state == 3) ifTrue: [ + "AAa -> A + Aa (camel case follows uppercase)" + aBlock value: last value: index - 2. + last := index - 1. + state := 2]. + (state > 3) ifTrue: [ + "+a -> + | a (letter follows non-letter)" + aBlock value: last value: index - 1. + last := index. + state := 1]] + ifFalse: [ + char isUppercase ifTrue: [ + (state == 0) + ifTrue: [state := 2] "start -> A" + ifFalse: [ + (state < 2 or: [state > 3]) ifTrue: [ + "*A -> * | A (uppercase begins, flush before)" + aBlock value: last value: index - 1. + last := index. + state := 2] ifFalse: [ + "AA -> AA (uppercase continues)" + state := 3]]] + ifFalse: [ + ("char == $: or:" char isSeparator) ifTrue: [ + "skip colon/whitespace" + (state > 0) ifTrue: [ + aBlock value: last value: index - 1. + state := 0]. + last := index + 1] + ifFalse: [ + char isDigit ifTrue: [ + (state == 0) + ifTrue: [state := 4] + ifFalse: [ + (state ~= 4) ifTrue: [ + aBlock value: last value: index - 1. + last := index. + state := 4]]] + ifFalse: [ + (state == 0) + ifTrue: [state := 5] + ifFalse: [ + (state < 5) ifTrue: [ + aBlock value: last value: index - 1. + last := index. + state := 5]]]]]]]. + last <= self size ifTrue: [ + aBlock value: last value: self size]! Item was added: + ----- Method: String>>findFeatures (in category 'accessing - features') ----- + findFeatures + + ^ Array streamContents: [:features | + self findFeaturesDo: [:feature | features nextPut: feature]]! Item was added: + ----- Method: String>>findFeaturesDo: (in category 'accessing - features') ----- + findFeaturesDo: aBlock + "Simple analysis for natural language in source code. No support for word stemming." + + self findFeatureIndicesDo: [:start :end | + (self at: start) isLetter ifTrue: [ + aBlock value: (self copyFrom: start to: end) asLowercase]].! Item was changed: ----- Method: String>>findTokens: (in category 'accessing') ----- findTokens: delimiters + "Answer the collection of tokens that result from parsing self." + + ^ OrderedCollection streamContents: [:tokens | + self + findTokens: delimiters + do: [:token | tokens nextPut: token]]! - "Answer the collection of tokens that result from parsing self. Return strings between the delimiters. Any character in the Collection delimiters marks a border. Several delimiters in a row are considered as just one separation. Also, allow delimiters to be a single character." - - | tokens keyStart keyStop separators | - - tokens := OrderedCollection new. - separators := delimiters isCharacter - ifTrue: [Array with: delimiters] - ifFalse: [delimiters]. - keyStop := 1. - [keyStop <= self size] whileTrue: - [keyStart := self skipDelimiters: separators startingAt: keyStop. - keyStop := self findDelimiters: separators startingAt: keyStart. - keyStart < keyStop - ifTrue: [tokens add: (self copyFrom: keyStart to: (keyStop - 1))]]. - ^tokens! Item was added: + ----- Method: String>>findTokens:do: (in category 'accessing') ----- + findTokens: delimiters do: aBlock + + self + findTokens: delimiters + indicesDo: [:start :end | aBlock value: (self copyFrom: start to: end)].! Item was added: + ----- Method: String>>findTokens:indicesDo: (in category 'accessing') ----- + findTokens: delimiters indicesDo: aBlock + "Parse self to find tokens between delimiters. Any character in the Collection delimiters marks a border. Several delimiters in a row are considered as just one separation. Also, allow delimiters to be a single character. Similar to #lineIndicesDo:." + + | tokens keyStart keyStop separators | + separators := delimiters isCharacter + ifTrue: [Array with: delimiters] + ifFalse: [delimiters]. + keyStop := 1. + [keyStop <= self size] whileTrue: [ + keyStart := self skipDelimiters: separators startingAt: keyStop. + keyStop := self findDelimiters: separators startingAt: keyStart. + keyStart < keyStop + ifTrue: [aBlock value: keyStart value: keyStop - 1]].! |
Credits go to Toni Mattis (https://github.com/amintos) for the idea and implementation! :-) Thanks!
|
In reply to this post by commits-2
On Thu, 4 Jul 2019, [hidden email] wrote:
> A new version of Collections was added to project The Inbox: > http://source.squeak.org/inbox/Collections-mt.838.mcz > > ==================== Summary ==================== > > Name: Collections-mt.838 > Author: mt > Time: 4 July 2019, 4:32:40.854026 pm > UUID: ac4ab442-79c0-d246-8dec-914be7ee5356 > Ancestors: Collections-pre.837 > > To String, adds simple analysis of natural language in source code. No word stemming. > > 1) Refactor #findTokens: to look like #lines (i.e. #linesDo: and #lineIndicesDo:). > 2) Add #findFeaturesDo: like #findTokens:do: and #linesDo:. > > Try this: > > HTTPDownloadRequest name findFeatures. > (Morph >> #drawOn:) getSource asString findFeatures. > > Where can that be useful? > > - Automatic insertion of "*" for search terms like "WeakDictionary" to also find WeakIdentityDictionary etc. > - Prefix emphasis for names lists of classes in code browsers: MCAddition, MCAncestry, etc. Given the new methods' completixy, I think they deserve tests. > > =============== Diff against Collections-pre.837 =============== > > Item was added: > + ----- Method: String>>findFeatureIndicesDo: (in category 'accessing - features') ----- > + findFeatureIndicesDo: aBlock > + "State machine that separates camelCase, UPPERCase, number/operator combinations and skips colons" I think an example would help make it is easier to understand what this method does. (The same applies to #findTokens:, but I'm already familiar with that.) > + | last state char "0 = start, 1 = a, 2 = A, 3 = AA, 4 = num, 5 = op" | > + > + state := 0. > + last := 1. > + > + 1 to: self size do: [ :index | > + char := self at: index. > + "a" > + char isLowercase ifTrue: [ > + (state < 3) ifTrue: [state := 1]. "*a -> a" > + (state == 3) ifTrue: [ #= is optimized just as good as #== when the argument is a constant. Using #= and dropping the unnecessary parentheses would make the code look a bit less "C-style". > + "AAa -> A + Aa (camel case follows uppercase)" > + aBlock value: last value: index - 2. > + last := index - 1. > + state := 2]. > + (state > 3) ifTrue: [ > + "+a -> + | a (letter follows non-letter)" > + aBlock value: last value: index - 1. > + last := index. > + state := 1]] > + ifFalse: [ > + char isUppercase ifTrue: [ > + (state == 0) > + ifTrue: [state := 2] "start -> A" > + ifFalse: [ > + (state < 2 or: [state > 3]) ifTrue: [ > + "*A -> * | A (uppercase begins, flush before)" > + aBlock value: last value: index - 1. > + last := index. > + state := 2] ifFalse: [ > + "AA -> AA (uppercase continues)" > + state := 3]]] > + ifFalse: [ > + ("char == $: or:" char isSeparator) ifTrue: [ > + "skip colon/whitespace" > + (state > 0) ifTrue: [ > + aBlock value: last value: index - 1. > + state := 0]. > + last := index + 1] > + ifFalse: [ > + char isDigit ifTrue: [ > + (state == 0) > + ifTrue: [state := 4] > + ifFalse: [ > + (state ~= 4) ifTrue: [ > + aBlock value: last value: index - 1. > + last := index. > + state := 4]]] > + ifFalse: [ > + (state == 0) > + ifTrue: [state := 5] > + ifFalse: [ > + (state < 5) ifTrue: [ > + aBlock value: last value: index - 1. > + last := index. > + state := 5]]]]]]]. > + last <= self size ifTrue: [ > + aBlock value: last value: self size]! > > Item was added: > + ----- Method: String>>findFeatures (in category 'accessing - features') ----- > + findFeatures > + > + ^ Array streamContents: [:features | > + self findFeaturesDo: [:feature | features nextPut: feature]]! > > Item was added: > + ----- Method: String>>findFeaturesDo: (in category 'accessing - features') ----- > + findFeaturesDo: aBlock > + "Simple analysis for natural language in source code. No support for word stemming." > + > + self findFeatureIndicesDo: [:start :end | > + (self at: start) isLetter ifTrue: [ > + aBlock value: (self copyFrom: start to: end) asLowercase]].! > > Item was changed: > ----- Method: String>>findTokens: (in category 'accessing') ----- > findTokens: delimiters > + "Answer the collection of tokens that result from parsing self." > + > + ^ OrderedCollection streamContents: [:tokens | #streamContents: should never be used with OrderedCollection. OrderedCollection has its own streaming API (I would use #addLast: here) which is way more efficient. > + self > + findTokens: delimiters > + do: [:token | tokens nextPut: token]]! > - "Answer the collection of tokens that result from parsing self. Return strings between the delimiters. Any character in the Collection delimiters marks a border. Several delimiters in a row are considered as just one separation. Also, allow delimiters to be a single character." > - > - | tokens keyStart keyStop separators | > - > - tokens := OrderedCollection new. > - separators := delimiters isCharacter > - ifTrue: [Array with: delimiters] > - ifFalse: [delimiters]. > - keyStop := 1. > - [keyStop <= self size] whileTrue: > - [keyStart := self skipDelimiters: separators startingAt: keyStop. > - keyStop := self findDelimiters: separators startingAt: keyStart. > - keyStart < keyStop > - ifTrue: [tokens add: (self copyFrom: keyStart to: (keyStop - 1))]]. > - ^tokens! > > Item was added: > + ----- Method: String>>findTokens:do: (in category 'accessing') ----- > + findTokens: delimiters do: aBlock > + > + self > + findTokens: delimiters > + indicesDo: [:start :end | aBlock value: (self copyFrom: start to: end)].! > > Item was added: > + ----- Method: String>>findTokens:indicesDo: (in category 'accessing') ----- > + findTokens: delimiters indicesDo: aBlock > + "Parse self to find tokens between delimiters. Any character in the Collection delimiters marks a border. Several delimiters in a row are considered as just one separation. Also, allow delimiters to be a single character. Similar to #lineIndicesDo:." > + > + | tokens keyStart keyStop separators | There are a few opportunities to regain the performance lost with the introduction of blocks and sends: - the tokens temporary is unused - self size should be cached in a temporary (size) - instead of Array >> #with:, a brace array should be used - | keyStop keyStart separators size | would probably yield the best performance Levente > + separators := delimiters isCharacter > + ifTrue: [Array with: delimiters] > + ifFalse: [delimiters]. > + keyStop := 1. > + [keyStop <= self size] whileTrue: [ > + keyStart := self skipDelimiters: separators startingAt: keyStop. > + keyStop := self findDelimiters: separators startingAt: keyStart. > + keyStart < keyStop > + ifTrue: [aBlock value: keyStart value: keyStop - 1]].! |
Hi Levente, thanks for the tips! I would appreciate another review of Collections-mt.839 and CollectionsTests-mt.313. :-) Best, Marcel
|
Free forum by Nabble | Edit this page |