[PATCH] Fix regexes that can match the empty string

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[PATCH] Fix regexes that can match the empty string

Paolo Bonzini-2
I read about this on a blog
(http://t-a-w.blogspot.com/2008/01/really-strange-quirk-of-ruby-and-perl.html 
if you care) and remembered that I fixed this once in sed.  Now, the
same for gst.  The behavior I implemented for tokenize is consistent
with ruby (didn't check perl and python), the behavior I implemented for
gsub is consistent with sed and python but not with ruby and perl.

Paolo

2008-01-24  Paolo Bonzini  <[hidden email]>

        * kernel/Regex.st: Fix global substitution and tokenization for
        regexes that can match the empty string.

 
diff --git a/kernel/Regex.st b/kernel/Regex.st
index dec5e6e..b074361 100644
--- a/kernel/Regex.st
+++ b/kernel/Regex.st
@@ -881,10 +881,11 @@ String extend [
  of the match (as in #%)."
 
  <category: 'regex'>
- | res idx regex beg end regs |
+ | res idx regex beg end regs emptyOk |
  regex := pattern asRegex.
  res := WriteStream on: (String new: to - from + 1).
  idx := from.
+ emptyOk := true.
 
  [regs := self
     searchRegexInternal: regex
@@ -894,17 +895,20 @@ String extend [
  whileFalse:
     [beg := regs from.
     end := regs to.
-    res
- next: beg - idx
- putAll: self
- startingAt: idx.
-    res nextPutAll: str % regs.
-    idx := end + 1.
-    beg > end
- ifTrue:
-    [res nextPut: (self at: idx).
-    idx := idx + 1].
-    idx > self size ifTrue: [^res contents]].
+    (beg <= end or: [ beg > idx or: [ emptyOk ]])
+ ifTrue: [
+    emptyOk := false.
+    res
+        next: beg - idx
+        putAll: self
+        startingAt: idx.
+            res nextPutAll: str % regs.
+    idx := end + 1]
+ ifFalse: [
+    beg <= to ifFalse: [^res contents].
+    emptyOk := true.
+    res nextPut: (self at: beg).
+    idx := beg + 1]].
  res
     next: to - idx + 1
     putAll: self
@@ -963,11 +967,11 @@ String extend [
  are separated and stored into an Array of Strings that is returned."
 
  <category: 'regex'>
- | res idx regex regs tokStart |
+ | res idx tokStart regex regs beg end emptyOk |
  regex := pattern asRegex.
  res := WriteStream on: (Array new: 10).
- idx := from.
- tokStart := 1.
+ idx := tokStart := from.
+ emptyOk := false.
 
  [regs := self
     searchRegexInternal: regex
@@ -975,10 +979,27 @@ String extend [
     to: to.
  regs notNil]
  whileTrue:
-    [res nextPut: (self copyFrom: tokStart to: regs from - 1).
-    tokStart := regs to + 1.
-    idx := regs to + 1 max: regs from + 1].
- res nextPut: (self copyFrom: tokStart to: to).
+    [beg := regs from.
+    end := regs to.
+    (beg <= end or: [ beg > idx or: [ emptyOk ]])
+ ifTrue: [
+    emptyOk := false.
+    res nextPut: (self copyFrom: tokStart to: beg - 1).
+    idx := tokStart := end + 1 ]
+ ifFalse: [
+    "If we reach the end of the string exit
+     without adding the token.  tokStart must have been
+     set above to TO + 1 (it was set above just before
+     setting emptyOk to false), so we'd add an empty
+     token we don't want."
+    beg <= to ifFalse: [^res contents].
+    emptyOk := true.
+
+    "By not updating tokStart we put the character in the
+     next token."
+    idx := beg + 1]].
+ (tokStart <= to or: [ emptyOk ])
+    ifTrue: [ res nextPut: (self copyFrom: tokStart to: to) ].
  ^res contents
     ]
 
diff --git a/tests/strings.ok b/tests/strings.ok
index f083526..2706df5 100644
--- a/tests/strings.ok
+++ b/tests/strings.ok
@@ -66,3 +66,48 @@ returned value is '   -   -    '
 
 Execution begins...
 returned value is ''
+
+Execution begins...
+returned value is 'xaxbxcx'
+
+Execution begins...
+returned value is 'fx'
+
+Execution begins...
+returned value is 'fx'
+
+Execution begins...
+returned value is 'fx'
+
+Execution begins...
+returned value is 'xbx'
+
+Execution begins...
+returned value is 'xbx'
+
+Execution begins...
+returned value is 'xbx'
+
+Execution begins...
+returned value is 'xbxcx'
+
+Execution begins...
+returned value is 'xbxcx'
+
+Execution begins...
+returned value is '('abc' 'def' )'
+
+Execution begins...
+returned value is '('' 'abc' 'def' )'
+
+Execution begins...
+returned value is '('a' 'b' 'c' )'
+
+Execution begins...
+returned value is '('a' )'
+
+Execution begins...
+returned value is '('a' )'
+
+Execution begins...
+returned value is '('a' )'
diff --git a/tests/strings.st b/tests/strings.st
index be74137..ee2a16e 100644
--- a/tests/strings.st
+++ b/tests/strings.st
@@ -95,3 +95,20 @@ Eval [ '388350028456431097' formatAs: 'Card Number #### ###### #### Expires ##/#
 Eval [ '543' formatAs: '###-###-####' ]
 Eval [ '' formatAs: '###-###-####' ]
 Eval [ '1234' formatAs: '' ]
+
+"Have fun with regexes that can match the empty string."
+Eval [ 'abc' copyReplacingAllRegex: 'x*' with: 'x' ] "xaxbxcx"
+Eval [ 'f' copyReplacingAllRegex: 'o*$' with: 'x' ] "fx"
+Eval [ 'fo' copyReplacingAllRegex: 'o*$' with: 'x' ] "fx"
+Eval [ 'foo' copyReplacingAllRegex: 'o*$' with: 'x' ] "fx"
+Eval [ 'ba' copyReplacingAllRegex: 'a*' with: 'x' ] "xbx"
+Eval [ 'baa' copyReplacingAllRegex: 'a*' with: 'x' ] "xbx"
+Eval [ 'baaa' copyReplacingAllRegex: 'a*' with: 'x' ] "xbx"
+Eval [ 'bc' copyReplacingAllRegex: 'a*' with: 'x' ] "xbxcx"
+Eval [ 'bac' copyReplacingAllRegex: 'a*' with: 'x' ] "xbxcx"
+Eval [ ('abc def ' tokenize: ' ') printString ] "(abc def)"
+Eval [ (' abc def ' tokenize: ' ') printString ] "('' abc def)"
+Eval [ ('abc' tokenize: 'x*') printString ] "(a b c)"
+Eval [ ('axxx' tokenize: 'x*') printString ] "(a)"
+Eval [ ('ax' tokenize: 'x*') printString ] "(a)"
+Eval [ ('a' tokenize: 'x*') printString ] "(a)"

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk