I read about this on a blog
(http://t-a-w.blogspot.com/2008/01/really-strange-quirk-of-ruby-and-perl.html if you care) and remembered that I fixed this once in sed. Now, the same for gst. The behavior I implemented for tokenize is consistent with ruby (didn't check perl and python), the behavior I implemented for gsub is consistent with sed and python but not with ruby and perl. Paolo 2008-01-24 Paolo Bonzini <[hidden email]> * kernel/Regex.st: Fix global substitution and tokenization for regexes that can match the empty string. diff --git a/kernel/Regex.st b/kernel/Regex.st index dec5e6e..b074361 100644 --- a/kernel/Regex.st +++ b/kernel/Regex.st @@ -881,10 +881,11 @@ String extend [ of the match (as in #%)." <category: 'regex'> - | res idx regex beg end regs | + | res idx regex beg end regs emptyOk | regex := pattern asRegex. res := WriteStream on: (String new: to - from + 1). idx := from. + emptyOk := true. [regs := self searchRegexInternal: regex @@ -894,17 +895,20 @@ String extend [ whileFalse: [beg := regs from. end := regs to. - res - next: beg - idx - putAll: self - startingAt: idx. - res nextPutAll: str % regs. - idx := end + 1. - beg > end - ifTrue: - [res nextPut: (self at: idx). - idx := idx + 1]. - idx > self size ifTrue: [^res contents]]. + (beg <= end or: [ beg > idx or: [ emptyOk ]]) + ifTrue: [ + emptyOk := false. + res + next: beg - idx + putAll: self + startingAt: idx. + res nextPutAll: str % regs. + idx := end + 1] + ifFalse: [ + beg <= to ifFalse: [^res contents]. + emptyOk := true. + res nextPut: (self at: beg). + idx := beg + 1]]. res next: to - idx + 1 putAll: self @@ -963,11 +967,11 @@ String extend [ are separated and stored into an Array of Strings that is returned." <category: 'regex'> - | res idx regex regs tokStart | + | res idx tokStart regex regs beg end emptyOk | regex := pattern asRegex. res := WriteStream on: (Array new: 10). - idx := from. - tokStart := 1. + idx := tokStart := from. + emptyOk := false. [regs := self searchRegexInternal: regex @@ -975,10 +979,27 @@ String extend [ to: to. regs notNil] whileTrue: - [res nextPut: (self copyFrom: tokStart to: regs from - 1). - tokStart := regs to + 1. - idx := regs to + 1 max: regs from + 1]. - res nextPut: (self copyFrom: tokStart to: to). + [beg := regs from. + end := regs to. + (beg <= end or: [ beg > idx or: [ emptyOk ]]) + ifTrue: [ + emptyOk := false. + res nextPut: (self copyFrom: tokStart to: beg - 1). + idx := tokStart := end + 1 ] + ifFalse: [ + "If we reach the end of the string exit + without adding the token. tokStart must have been + set above to TO + 1 (it was set above just before + setting emptyOk to false), so we'd add an empty + token we don't want." + beg <= to ifFalse: [^res contents]. + emptyOk := true. + + "By not updating tokStart we put the character in the + next token." + idx := beg + 1]]. + (tokStart <= to or: [ emptyOk ]) + ifTrue: [ res nextPut: (self copyFrom: tokStart to: to) ]. ^res contents ] diff --git a/tests/strings.ok b/tests/strings.ok index f083526..2706df5 100644 --- a/tests/strings.ok +++ b/tests/strings.ok @@ -66,3 +66,48 @@ returned value is ' - - ' Execution begins... returned value is '' + +Execution begins... +returned value is 'xaxbxcx' + +Execution begins... +returned value is 'fx' + +Execution begins... +returned value is 'fx' + +Execution begins... +returned value is 'fx' + +Execution begins... +returned value is 'xbx' + +Execution begins... +returned value is 'xbx' + +Execution begins... +returned value is 'xbx' + +Execution begins... +returned value is 'xbxcx' + +Execution begins... +returned value is 'xbxcx' + +Execution begins... +returned value is '('abc' 'def' )' + +Execution begins... +returned value is '('' 'abc' 'def' )' + +Execution begins... +returned value is '('a' 'b' 'c' )' + +Execution begins... +returned value is '('a' )' + +Execution begins... +returned value is '('a' )' + +Execution begins... +returned value is '('a' )' diff --git a/tests/strings.st b/tests/strings.st index be74137..ee2a16e 100644 --- a/tests/strings.st +++ b/tests/strings.st @@ -95,3 +95,20 @@ Eval [ '388350028456431097' formatAs: 'Card Number #### ###### #### Expires ##/# Eval [ '543' formatAs: '###-###-####' ] Eval [ '' formatAs: '###-###-####' ] Eval [ '1234' formatAs: '' ] + +"Have fun with regexes that can match the empty string." +Eval [ 'abc' copyReplacingAllRegex: 'x*' with: 'x' ] "xaxbxcx" +Eval [ 'f' copyReplacingAllRegex: 'o*$' with: 'x' ] "fx" +Eval [ 'fo' copyReplacingAllRegex: 'o*$' with: 'x' ] "fx" +Eval [ 'foo' copyReplacingAllRegex: 'o*$' with: 'x' ] "fx" +Eval [ 'ba' copyReplacingAllRegex: 'a*' with: 'x' ] "xbx" +Eval [ 'baa' copyReplacingAllRegex: 'a*' with: 'x' ] "xbx" +Eval [ 'baaa' copyReplacingAllRegex: 'a*' with: 'x' ] "xbx" +Eval [ 'bc' copyReplacingAllRegex: 'a*' with: 'x' ] "xbxcx" +Eval [ 'bac' copyReplacingAllRegex: 'a*' with: 'x' ] "xbxcx" +Eval [ ('abc def ' tokenize: ' ') printString ] "(abc def)" +Eval [ (' abc def ' tokenize: ' ') printString ] "('' abc def)" +Eval [ ('abc' tokenize: 'x*') printString ] "(a b c)" +Eval [ ('axxx' tokenize: 'x*') printString ] "(a)" +Eval [ ('ax' tokenize: 'x*') printString ] "(a)" +Eval [ ('a' tokenize: 'x*') printString ] "(a)" _______________________________________________ help-smalltalk mailing list [hidden email] http://lists.gnu.org/mailman/listinfo/help-smalltalk |
Free forum by Nabble | Edit this page |