Smalltalk › Gnu

[poll] regex literals

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

10 messages Options

Paolo Bonzini-2

[poll] regex literals

I'm thinking of adding regex literals to GNU Smalltalk. The only syntax
I found that would work is ##/regex/. /regex/ wouldn't work for the old
syntax, because the lexer has no way to understand that the / in this
example

a: b
/regex/ printNl

starts a regex and is not a division operator. It would work in the new
syntax (after one of [ ( { ^ . keyword: identifier binary-message, and
maybe a few more I forgot, / would start a regex, otherwise it would be
a division operator), but I don't like to add a feature that cannot be
ported to other Smalltalks.

What do you think? Right now I'm more for "no" or "not yet", but I'm
open to discussion.

Paolo

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk

Stefan Schmiedl

Re: [poll] regex literals

On Wed, 13 Feb 2008 09:58:40 +0100
Paolo Bonzini <[hidden email]> wrote:

> I'm thinking of adding regex literals to GNU Smalltalk. The only syntax
> I found that would work is ##/regex/.
> ...
> What do you think? Right now I'm more for "no" or "not yet", but I'm
> open to discussion.

One thing I've seen with locale describing symbols in VisualWorks is
#"de_de.UTF-8", so going along with this approach something like
#/.../ makes sense.

s.

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk

Paolo Bonzini-2

Re: [poll] regex literals

>> I'm thinking of adding regex literals to GNU Smalltalk. The only syntax
>> I found that would work is ##/regex/.
>> ...
>> What do you think? Right now I'm more for "no" or "not yet", but I'm
>> open to discussion.
>
> One thing I've seen with locale describing symbols in VisualWorks is
> #"de_de.UTF-8"

Yes, that's #'de_de.UTF-8'. It's supported in GNU Smalltalk too, for
"weird" symbols that are not valid Smalltalk message names.

> , so going along with this approach something like
> #/.../ makes sense.

Two hashes because #/ is valid Smalltalk.

Paolo

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk

Stefan Schmiedl

Re: [poll] regex literals

On Wed, 13 Feb 2008 10:38:35 +0100
Paolo Bonzini <[hidden email]> wrote:

> > One thing I've seen with locale describing symbols in VisualWorks is
> > #"de_de.UTF-8"
>
> Yes, that's #'de_de.UTF-8'.

Obviously, I need Smalltalk syntax coloring for my mail client :-)

s.

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk

Paolo Bonzini-2

Re: [poll] regex literals

In reply to this post by Paolo Bonzini-2

Tony Garnock-Jones wrote:
> Paolo Bonzini wrote:
>> I'm thinking of adding regex literals to GNU Smalltalk.
>
> I'd be against this.
>
> 'a.*b' asRegex
>
> to me seems better, and doesn't require and lexer/parser changes.

It's also slower, which is why as of today 'a.*b' works even without
sending #asRegex.

However, *always* treating string literals as regexes is going to give
problems in the long term. In particular, it would break with another
extension that I was thinking about:

#(1 3 2 6 5 4) select: #odd => #(1 3 5)
#(1 12 2) select: (1 to: 10) => #(1 12)
#('foo' 'bar') select: ##/f./ => #('foo')

This would be quite easily implemented (#select: would send a new
message to its argument, e.g. #~, instead of #value:). If regexes would
be implemented simply as strings, however, there would be a conflict
between the Collection example (second) and the regex example (third):

'foo' select: 'aeiouy' => 'oo'
#('foo') select: 'f.' => cannot make it return 'foo' as I'd like!

That's why in this case, simply using string literals as regexes
wouldn't work. You would need to specify #asRegex to get the desired
behavior.

As I said, I'm also thinking "no"/"not yet". It's not paramount: older
code would be unaffected, and I could start implementing the above
(which is not happening any time soon), and then see if it is a problem.
Just, there *might* be one.

Paolo

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk

Tony Garnock-Jones-2

Re: [poll] regex literals

Paolo Bonzini wrote:
> It's also slower, which is why as of today 'a.*b' works even without
> sending #asRegex.

Slower because of repeated sends of asRegex?

I'd rather see new syntax for compile-time evaluation in literal
position, instead of specialised syntax for regex literals.

##('a.*b' asRegex)

> However, *always* treating string literals as regexes is going to give
> problems in the long term.

Agreed. Type-punning is often not a great idea.

Regards,
Tony

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk

Paolo Bonzini-2

Re: [poll] regex literals

>> It's also slower, which is why as of today 'a.*b' works even without
>> sending #asRegex.
>
> Slower because of repeated sends of asRegex?

Yes. Or just because 1 send is already more than 0!

> I'd rather see new syntax for compile-time evaluation in literal
> position, instead of specialised syntax for regex literals.
>
> ##('a.*b' asRegex)

A bit verbose but yes, it is a possibility if performance is a concern.
And it works now.

Paolo

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk

Tony Garnock-Jones-2

Re: [poll] regex literals

Paolo Bonzini wrote:
> A bit verbose but yes, it is a possibility if performance is a concern.
> And it works now.

Sorry? There's existing compile-time-eval syntax? Cool!

Tony

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk

S11001001

Re: [poll] regex literals

In reply to this post by Paolo Bonzini-2

Paolo Bonzini <[hidden email]> writes:
> However, *always* treating string literals as regexes is going to give
> problems in the long term. In particular, it would break with another
> extension that I was thinking about:
>
> #(1 3 2 6 5 4) select: #odd => #(1 3 5)

This is sort of in the Presource test suite:

#(1 3 2 6 5 4) select: #odd sendingBlock
-| #(1 3 2 6 5 4) select: [:gensym | gensym odd]
=> #(1 3 5)

> #(1 12 2) select: (1 to: 10) => #(1 12)

I would not use that :)

> #('foo' 'bar') select: ##/f./ => #('foo')
>
> This would be quite easily implemented (#select: would send a new
> message to its argument, e.g. #~, instead of #value:).

I would rather have a generalization of the sendingBlock protocol to
send explicitly, perhaps with this extension (because with literals,
there's no chance for confusion):

Eval [NoCandy.MyCodeMindset installIn: Namespace current]

NoCandy.Presrc.MessageMacro subclass: SelectLiteralBlocks [
<pool: NoCandy.Presrc> "eh?"

"obviously you would memoize this result"
SelectLiteralBlocks class >> inlinableActions [
"since, for all these cases, the standard #select: semantics
*obviously* aren't useful"
^{'`@x to: `@y' -> '[:`g1 | `g1 between: `@x and: `@y]'
-> [:m |
((m atAll: #('`@x' '`@y')) allSatisfy: [:each |
each isLiteral and: [each value isInteger]])
ifTrue: [{'`g1' -> self newVariable}]].
'`@x' -> '[:`g1 | `g1 `sel]'
-> [:m | | sel |
sel := m at: '`@x'.
{sel isLiteral.
sel value isSymbol.
sel value numArgs = 0}
condEvery ifTrue: [{'`g1' -> self newVariable.
#'`sel' -> sel value}]].
'`@x' -> '[:`g1 | `g1 ~ `@x]'
-> [:m | | x |
x := m at: '`@x'.
(x isLiteral and: [x value isRegex])
ifTrue: [{'`g1' -> self newVariable}]].
} collect: [:triplet |
{CodeTemplate fromExpr: triplet key key.
CodeTemplate fromExpr: triplet key value.
triplet value}]
]

expandMessage: sel to: rcv withArguments: args [
| filter |
filter := args first.
self class inlinableActions do: [:triplet | | match expand test |
match := triplet first. expand := triplet second.
test := triplet third.
(match match: filter) ifNotNil: [:pm |
(test value: pm) ifNotNil: [:xtn |
xtn do: [:each | pm add: each].
^STInST.RBMessageNode
receiver: rcv
selector: sel
arguments: {expand expand: pm}]]].
^self forgoExpansion
]
]

#(1 12 2) select: (1 to: 10)
-| #(1 12 2) select: [:gensym | gensym between: 1 and: 10]
=> #(1 2)

On a side note, with Unicode, #∋ would be a good name for #~, or maybe
#includes: :)

--
But you know how reluctant paranormal phenomena are to reveal
themselves when skeptics are present. --Robert Sheaffer, SkI 9/2003

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk

Paolo Bonzini-2

Re: [poll] regex literals

> This is sort of in the Presource test suite:
>
> #(1 3 2 6 5 4) select: #odd sendingBlock
> -| #(1 3 2 6 5 4) select: [:gensym | gensym odd]
> => #(1 3 5)
>
>> #(1 12 2) select: (1 to: 10) => #(1 12)
>
> I would not use that :)

Note that it's just a special case of Collections:

'foobar' select: 'aeiou' => 'ooa'

In fact, "#(1 1.2 2) select: (1 to: 10)" would *not* include 1.2 in the
result.

My desire is to allow the common idea of "select: #odd" without
implementing Symbol>>#value:. I see no need to implement #sendingBlock
(all this IMHO of course) if you reason that:

1) right now, #select: and #collect: have the same "protocol" for the
argument, but the two are very different. In the case of
#select:/#reject: the argument should return true/false for any
collection; for #collect: instead the argument should return an object
in the same domain as the source.

Taking an extreme position: #value: is the most overloaded method in
Smalltalk and the less you use it, the better. :-) (Because then you
can achieve more polymorphism and more DWIM).

2) therefore, I decide that #select: (and #reject:) accept a different
thing than a block, a "predicate". A predicate can be a unary block of
course, but also a symbol, a regex, a collection, ... I chose #~ as the
message that the predicate protocol would implement because it's what we
use for regexes, but it's not necessary to implement it with that name
(also because we currently have "aString ~ aRegex", not the other way
round).

3) the same could apply to #collect:, but with a *different* message to
emphasize that the argument is not a "predicate", it is an "xyz" (name
to be decided :-) I didn't find any good one). I don't have very strong
ideas on how to call the message, but it also could apply to symbols,
regexes and collections: for example

#('1.2' '3.4') collect: #allButLast => #('1.' '3.')

#('1.2' '3.4') collect: '^.*\.' asRegex => #('1.' '3.')
#('1.2' '3.4') collect: '\.(.*)' asRegex => #('2' '4')

#('foo' 'bar') collect: #(1 3) => #('fo' 'br')

> NoCandy.Presrc.MessageMacro subclass: SelectLiteralBlocks [
> <pool: NoCandy.Presrc> "eh?"

You mean <import: ...> here?

> On a side note, with Unicode, #∋ would be a good name for #~, or maybe
> #includes: :)

Now what Unicode symbols would be binary messages, and which would be
okay for identifiers/keywords? :-)

Paolo

_______________________________________________
help-smalltalk mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/help-smalltalk