Regular Expressions are not limited to Strings

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Regular Expressions are not limited to Strings

Christoph Thiede

Hi all,


just a small goody for all those interested: It turns out that, thanks to the great polymorphy in Squeak, regular expressions (as implemented in the Regex package of Trunk originally developed by Vassili Bykov) are not limited to collections that are actually strings. Here is a short counter-example:


regex := RxParser new parse: #(1 2 $+ 1).

matcher := RxParser preferredMatcherClass for: regex.

matcher matches: #(1 2 2 2 1). "true!"


To make the example work, only a small number of hard-coded class names have to be adjusted, see the attached changeset, it's really tiny.


Here's another example:


matcher copy: #(1 2 2 1 0 1 2 1) translatingMatchesUsing: [:match | match negated]. "#(-1 -2 -2 -1 0 -1 -2 -1)"


This also allows us to style texts using regexes:


matcher := 'ab+a' asRegex.

matcher copy: ' aa-aba-abba ' asText translatingMatchesUsing: [:match | match allBold]. " aa-aba-abba "


However, if the original text attributes should be preserved, we would need to hack TextStream >> #withAttributes:do: into the copy methods, analogously to Text >> #format:. I guess this limitation could only be resolved by redesigning Text as a collection of TextCharacters, which might be very slow.


Nevertheless, I think this insight opens great possibilities for other forms of parsing. Maybe one could also process binary streams using polymorphic regex patterns, or even process sequences of domain-specific objects. Because RxsPredicate is so generic, you could also simply define custom predicates for these objects. Later, the next step could be adding support for nested collections (RxsNested?) so that you could parse entire trees of objects ... Ah, so beautiful dreams :-)


Best,

Christoph




regex-polymorphy.2.cs (5K) Download Attachment
Carpe Squeak!
Reply | Threaded
Open this post in threaded view
|

Re: Regular Expressions are not limited to Strings

Christoph Thiede

Woohoo, nested regular expressions are even easier than I thought!


innerRegex := RxParser new parse: #(3 4 $* 5).
innerMatcher := RxParser preferredMatcherClass for: innerRegex.
regex := RxParser new parse: {#(1 2). innerMatcher. ${.$,.$3.$}. #(6 7)}.
matcher := RxParser preferredMatcherClass for: regex.

matcher matches: #((1 2) (3 4 5) (6 7)). "true"
matcher matches: #((1 2) (3 4 5) (3 5) (6 7)). "true"
matcher matches: #((1 2) (3 4 5) (3 5) (3 4 4 5) (6 7)). "true"
matcher matches: #((1 2) (3 4 5) (3 5) (3 4 4 5) (3 5) (6 7)). "false"
matcher matches: #((1 2) (3 4 5) (3 2 5) (3 4 4 5) (6 7)). "false"

I'll share my changeset upon request. This is really exciting stuff.


Best,
Christoph


Von: Squeak-dev <[hidden email]> im Auftrag von Thiede, Christoph
Gesendet: Mittwoch, 7. April 2021 19:37 Uhr
An: Squeak Dev
Betreff: [squeak-dev] Regular Expressions are not limited to Strings
 

Hi all,


just a small goody for all those interested: It turns out that, thanks to the great polymorphy in Squeak, regular expressions (as implemented in the Regex package of Trunk originally developed by Vassili Bykov) are not limited to collections that are actually strings. Here is a short counter-example:


regex := RxParser new parse: #(1 2 $+ 1).

matcher := RxParser preferredMatcherClass for: regex.

matcher matches: #(1 2 2 2 1). "true!"


To make the example work, only a small number of hard-coded class names have to be adjusted, see the attached changeset, it's really tiny.


Here's another example:


matcher copy: #(1 2 2 1 0 1 2 1) translatingMatchesUsing: [:match | match negated]. "#(-1 -2 -2 -1 0 -1 -2 -1)"


This also allows us to style texts using regexes:


matcher := 'ab+a' asRegex.

matcher copy: ' aa-aba-abba ' asText translatingMatchesUsing: [:match | match allBold]. " aa-aba-abba "


However, if the original text attributes should be preserved, we would need to hack TextStream >> #withAttributes:do: into the copy methods, analogously to Text >> #format:. I guess this limitation could only be resolved by redesigning Text as a collection of TextCharacters, which might be very slow.


Nevertheless, I think this insight opens great possibilities for other forms of parsing. Maybe one could also process binary streams using polymorphic regex patterns, or even process sequences of domain-specific objects. Because RxsPredicate is so generic, you could also simply define custom predicates for these objects. Later, the next step could be adding support for nested collections (RxsNested?) so that you could parse entire trees of objects ... Ah, so beautiful dreams :-)


Best,

Christoph



Carpe Squeak!