The Inbox: Collections-mt.839.mcz

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

The Inbox: Collections-mt.839.mcz

commits-2
A new version of Collections was added to project The Inbox:
http://source.squeak.org/inbox/Collections-mt.839.mcz

==================== Summary ====================

Name: Collections-mt.839
Author: mt
Time: 5 July 2019, 4:54:03.194089 pm
UUID: 76695fa3-7ed2-cb4e-8c6e-b33c83dd526f
Ancestors: Collections-mt.838

Improves code commentary, structure, and performance. Thanks Levente for the tips!

(Adds tests for #findFeatures in CollectionsTests-mt.313).

=============== Diff against Collections-mt.838 ===============

Item was changed:
  ----- Method: String>>findFeatureIndicesDo: (in category 'accessing - features') -----
  findFeatureIndicesDo: aBlock
+ "Support for simple analysis of natural language in source code.
- "State machine that separates camelCase, UPPERCase, number/operator combinations and skips colons"
- | last state char "0 = start, 1 = a, 2 = A, 3 = AA, 4 = num, 5 = op"  |
 
+ In addition to whitespace separation like #findTokens:, also separate features using higher-level rules:
+ (1) 'camelCase' -> #('camel' 'Case'),
+ (2) 'UPPERCase' -> #('UPPER' 'Case'),
+ (3) integer numbers such as 'MyModel55' -> #('My' 'Model' '55'), and
+ (4) operators such as '5 <= 4' -> #('5' '<=' '4').
+ Other kinds of characters are tokenized as operators: '[state := 2]' -> #('[' 'state' ':=' '2' ']').
- state := 0.
- last := 1.
 
+ This method works like #lineIndicesDo: and provides start/stop indices of tokens to the given aBlock to, for example, extract and normalize features (or tokens)."
+
+ | last state char |
+ state := 0. "0 = start, 1 = a, 2 = A, 3 = AA, 4 = num, 5 = op"
+ last := 1. "last character index"
+
  1 to: self size do: [ :index |
  char := self at: index.
+
+ char isLowercase
+ ifTrue: [ "a"
+ state < 3 ifTrue: [state := 1]. "*a -> a"
+ state = 3 ifTrue: [
+ "AAa -> A + Aa (camel case follows uppercase)"
+ aBlock value: last value: index - 2.
+ last := index - 1.
+ state := 2].
+ state > 3 ifTrue: [
+ "+a -> + | a (letter follows non-letter)"
+ aBlock value: last value: index - 1.
+ last := index.
+ state := 1]]
- "a"
- char isLowercase ifTrue: [
- (state < 3) ifTrue: [state := 1]. "*a -> a"
- (state == 3) ifTrue: [
- "AAa -> A + Aa (camel case follows uppercase)"
- aBlock value: last value: index - 2.
- last := index - 1.
- state := 2].
- (state > 3) ifTrue: [
- "+a -> + | a (letter follows non-letter)"
- aBlock value: last value: index - 1.
- last := index.
- state := 1]]
- ifFalse: [
- char isUppercase ifTrue: [
- (state == 0)
- ifTrue: [state := 2] "start -> A"
- ifFalse: [
- (state < 2 or: [state > 3]) ifTrue: [
- "*A -> * | A (uppercase begins, flush before)"
- aBlock value: last value: index - 1.
- last := index.
- state := 2] ifFalse: [
- "AA -> AA (uppercase continues)"
- state := 3]]]
  ifFalse: [
+ char isUppercase
+ ifTrue: [ "A"
+ state = 0
+ ifTrue: [state := 2] "start -> A"
- ("char == $: or:" char isSeparator) ifTrue: [
- "skip colon/whitespace"
- (state > 0) ifTrue: [
- aBlock value: last value: index - 1.
- state := 0].
- last := index + 1]
- ifFalse: [
- char isDigit ifTrue: [
- (state == 0)
- ifTrue: [state := 4]
  ifFalse: [
+ (state < 2) | (state > 3) ifTrue: [
+ "*A -> * | A (uppercase begins, flush before)"
+ aBlock value: last value: index - 1.
+ last := index.
+ state := 2] ifFalse: [
+ "AA -> AA (uppercase continues)"
+ state := 3]]]
+ ifFalse: [
+ char isSeparator
+ ifTrue: [ " "
+ "skip whitespace"
+ state > 0 ifTrue: [
+ aBlock value: last value: index - 1.
+ state := 0].
+ last := index + 1]
+ ifFalse: [
+
+ char isDigit
+ ifTrue: [ "num"
+ state = 0
+ ifTrue: [state := 4]
+ ifFalse: [
+ state ~= 4 ifTrue: [
+ aBlock value: last value: index - 1.
+ last := index.
+ state := 4]]]
+ ifFalse: [ "op"
+ state = 0
+ ifTrue: [state := 5]
+ ifFalse: [
+ state < 5 ifTrue: [
+ aBlock value: last value: index - 1.
+ last := index.
+ state := 5]] ] ] ] ] ].
- (state ~= 4) ifTrue: [
- aBlock value: last value: index - 1.
- last := index.
- state := 4]]]
- ifFalse: [
- (state == 0)
- ifTrue: [state := 5]
- ifFalse: [
- (state < 5) ifTrue: [
- aBlock value: last value: index - 1.
- last := index.
- state := 5]]]]]]].
  last <= self size ifTrue: [
  aBlock value: last value: self size]!

Item was changed:
  ----- Method: String>>findFeaturesDo: (in category 'accessing - features') -----
  findFeaturesDo: aBlock
+ "Simple analysis of natural language in source code. Select all features that are letters only, normalize them as lowercase. No support for word stemming.
+
+ Example:
+ 'Transcript show: 123 asString; cr; show: #HelloWorld.'
+ -> #('transcript' 'show' 'as' 'string' 'cr' 'show' 'hello' 'world')
+ "
- "Simple analysis for natural language in source code. No support for word stemming."
 
  self findFeatureIndicesDo: [:start :end |
  (self at: start) isLetter ifTrue: [
  aBlock value: (self copyFrom: start to: end) asLowercase]].!

Item was added:
+ ----- Method: String>>findTokens (in category 'accessing - tokens') -----
+ findTokens
+
+ ^ self findTokens: Character separators!

Item was changed:
  ----- Method: String>>findTokens: (in category 'accessing') -----
  findTokens: delimiters
+ "Answer the collection of tokens between delimiters, which results from parsing self."
- "Answer the collection of tokens that result from parsing self."
 
+ | tokens |
+ tokens := OrderedCollection new.
+ self
+ findTokens: delimiters
+ do: [:token | tokens addLast: token].
+ ^ tokens!
- ^ OrderedCollection streamContents: [:tokens |
- self
- findTokens: delimiters
- do: [:token | tokens nextPut: token]]!

Item was changed:
+ ----- Method: String>>findTokens:do: (in category 'accessing - tokens') -----
- ----- Method: String>>findTokens:do: (in category 'accessing') -----
  findTokens: delimiters do: aBlock
 
  self
  findTokens: delimiters
  indicesDo: [:start :end | aBlock value: (self copyFrom: start to: end)].!

Item was changed:
+ ----- Method: String>>findTokens:indicesDo: (in category 'accessing - tokens') -----
+ findTokens: oneOrMoreCharacters indicesDo: aBlock
+ "Parse self to find tokens between delimiters. Any character in the Collection delimiters marks a border.  Several delimiters in a row are considered as just one separation. The interface is similar to #lineIndicesDo:."
- ----- Method: String>>findTokens:indicesDo: (in category 'accessing') -----
- findTokens: delimiters indicesDo: aBlock
- "Parse self to find tokens between delimiters. Any character in the Collection delimiters marks a border.  Several delimiters in a row are considered as just one separation.  Also, allow delimiters to be a single character. Similar to #lineIndicesDo:."
 
+ | keyStart keyStop separators size |
+ size := self size.
+ separators := oneOrMoreCharacters isCharacter
+ ifTrue: [{oneOrMoreCharacters}]
+ ifFalse: [oneOrMoreCharacters].
- | tokens keyStart keyStop separators |
- separators := delimiters isCharacter
- ifTrue: [Array with: delimiters]
- ifFalse: [delimiters].
  keyStop := 1.
+ [keyStop <= size] whileTrue: [
- [keyStop <= self size] whileTrue: [
  keyStart := self skipDelimiters: separators startingAt: keyStop.
  keyStop := self findDelimiters: separators startingAt: keyStart.
  keyStart < keyStop
  ifTrue: [aBlock value: keyStart value: keyStop - 1]].!