Smalltalk › Cincom › VisualWorks

[vwnc] Fwd: Regex11

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

18 messages Options

Steffen Märcker

[vwnc] Fwd: Regex11

Today I've written Vassili Bykov because of a litte bug in Regex11 which
could be intersting for you too. Any opinions?

------- Weitergeleitete Nachricht -------
An: "Vassili Bykov" <[hidden email]>
Kopie:
Betreff: Regex11
Datum: Mon, 15 Sep 2008 09:33:13 +0200

Dear Vassili Bykov,

as far as I can see you've written the Regex11 package and I think I've
found a bug. The documentation states:

\w any word constituent character (same as [a-zA-Z0-9_])

but the actual implementation fails with underscores:

'_' matchesRegex: '\w' "evaluates to false"
'_' matchesRegex: '[a-zA-Z0-9_]' "evaluates to true"

Should the bug be fixed because we rely on the definition of \w or is it
the better way to adjust the documentation in order not to break existing
applications? How can I help?

Greetings,
Steffen
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Cesar Rabak

Re: [vwnc] Fwd: Regex11

Steffen Märcker escreveu:

> Today I've written Vassili Bykov because of a litte bug in Regex11 which
> could be intersting for you too. Any opinions?
>
>
> ------- Weitergeleitete Nachricht -------
> An: "Vassili Bykov" <[hidden email]>
> Kopie:
> Betreff: Regex11
> Datum: Mon, 15 Sep 2008 09:33:13 +0200
>
> Dear Vassili Bykov,
>
> as far as I can see you've written the Regex11 package and I think I've
> found a bug. The documentation states:
>
> \w any word constituent character (same as [a-zA-Z0-9_])
>
> but the actual implementation fails with underscores:
>
> '_' matchesRegex: '\w' "evaluates to false"
> '_' matchesRegex: '[a-zA-Z0-9_]' "evaluates to true"
>
> Should the bug be fixed because we rely on the definition of \w or is it
> the better way to adjust the documentation in order not to break existing
> applications? How can I help?
>
> Greetings,

Steffen,

I don't understand your "bug" report because the documentations states
"...same as [a-zA-Z0-9*_*]" the stars are for emphasis.

Isn't the documentations stating what the code does?

Are you with expectations due *other* definitions for \w?

--
Cesar Rabak
GNU/Linux User 52247.
Get counted: http://counter.li.org/
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Steffen Märcker

Re: [vwnc] Fwd: Regex11

In reply to this post by Steffen Märcker

Hi,

> Isn't the documentations stating what the code does?

exactly, the documentation states '_' should match '\w', but the code does not. Just evaluate the given code examples:

>> '_' matchesRegex: '\w' "evaluates to false"
>> '_' matchesRegex: '[a-zA-Z0-9_]' "evaluates to true"

As far as I understand both have to give true... Or have I missed something?

Regards,
Steffen
_______________________________________________________________________
Jetzt neu! Schützen Sie Ihren PC mit McAfee und WEB.DE. 30 Tage
kostenlos testen. http://www.pc-sicherheit.web.de/startseite/?mc=022220

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

skrish

Re: [vwnc] Fwd: Regex11

What Steffen states is obviously right..

From Smacc Scanner.html:

\w	Matches any letter, number or underscore, [A-Za-z0-9_].

'_' or the underscore should be matched...

regards

skrish

On Tue, Sep 16, 2008 at 12:22 PM, Steffen Märcker <[hidden email]> wrote:

Hi,

> Isn't the documentations stating what the code does?

exactly, the documentation states '_' should match '\w', but the code does not. Just evaluate the given code examples:

>> '_' matchesRegex: '\w' "evaluates to false"
>> '_' matchesRegex: '[a-zA-Z0-9_]' "evaluates to true"

As far as I understand both have to give true... Or have I missed something?

Regards,
Steffen
_______________________________________________________________________
Jetzt neu! Schützen Sie Ihren PC mit McAfee und WEB.DE. 30 Tage
kostenlos testen. http://www.pc-sicherheit.web.de/startseite/?mc=022220

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Steven Kelly

Re: [vwnc] Fwd: Regex11

In reply to this post by Steffen Märcker

Different regex libraries differ in whether \w includes underscore. The current implementation of Regex11 does not have a bug per se: the code clearly says “char isAlphaNumeric”; rather, the documentation is incorrect.

However, the move in regex libraries in general seems to be towards including underscore:

http://jakarta.apache.org/regexp/changes.html (added in 1.3, 2003)

http://nlp.stanford.edu/nlp/javadoc/gnu-regexp-docs/changes.html (added in 1.1, 2001)

http://savannah.gnu.org/bugs/?19637#discussion (Gnu documentation “\w = [[:alnum]]” corrected 2007)

How much do people feel adding underscore to \w in Regex11 would be a breaking change, and how much a bug fix?

Steve

From: [hidden email] [mailto:[hidden email]] On Behalf Of Sudhakar Krishnamachari
Sent: 16 September 2008 13:48
To: Steffen Märcker
Cc: vwnc mailing list; [hidden email]
Subject: Re: [vwnc] Fwd: Regex11

What Steffen states is obviously right..

From Smacc Scanner.html:

\w	Matches any letter, number or underscore, [A-Za-z0-9_].

'_' or the underscore should be matched...

regards

skrish

On Tue, Sep 16, 2008 at 12:22 PM, Steffen Märcker <[hidden email]> wrote:

Hi,

> Isn't the documentations stating what the code does?

exactly, the documentation states '_' should match '\w', but the code does not. Just evaluate the given code examples:

>> '_' matchesRegex: '\w' "evaluates to false"
>> '_' matchesRegex: '[a-zA-Z0-9_]' "evaluates to true"

As far as I understand both have to give true... Or have I missed something?

Regards,
Steffen
_______________________________________________________________________
Jetzt neu! Schützen Sie Ihren PC mit McAfee und WEB.DE. 30 Tage
kostenlos testen. http://www.pc-sicherheit.web.de/startseite/?mc=022220

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Steffen Märcker

Re: [vwnc] Fwd: Regex11

In my opinion it's a clear bug fix. ... and similarity to common
implementations is welcome - of course.

+1 for the fix

Am 16.09.2008, 13:59 Uhr, schrieb Steven Kelly <[hidden email]>:

> Different regex libraries differ in whether \w includes underscore. The
> current implementation of Regex11 does not have a bug per se: the code
> clearly says "char isAlphaNumeric"; rather, the documentation is
> incorrect.
>
>
> However, the move in regex libraries in general seems to be towards
> including underscore:
>
>
> http://jakarta.apache.org/regexp/changes.html (added in 1.3, 2003)
>
> http://nlp.stanford.edu/nlp/javadoc/gnu-regexp-docs/changes.html (added
> in 1.1, 2001)
>
> http://savannah.gnu.org/bugs/?19637#discussion (Gnu documentation "\w =
> [[:alnum]]" corrected 2007)
>
>
> How much do people feel adding underscore to \w in Regex11 would be a
> breaking change, and how much a bug fix?
>
>
> Steve
>
>
>
> From: [hidden email] [mailto:[hidden email]] On
> Behalf Of Sudhakar Krishnamachari
> Sent: 16 September 2008 13:48
> To: Steffen Märcker
> Cc: vwnc mailing list; [hidden email]
> Subject: Re: [vwnc] Fwd: Regex11
>
>
> What Steffen states is obviously right..
>
>
>> From Smacc Scanner.html:
>
>
> \w
>
> Matches any letter, number or underscore, [A-Za-z0-9_].
>
>
> '_' or the underscore should be matched...
>
>
> regards
>
> skrish
>
>
> On Tue, Sep 16, 2008 at 12:22 PM, Steffen Märcker <[hidden email]> wrote:
>
> Hi,
>
>
>> Isn't the documentations stating what the code does?
>
> exactly, the documentation states '_' should match '\w', but the code
> does not. Just evaluate the given code examples:
>
>
>>> '_' matchesRegex: '\w' "evaluates to false"
>>> '_' matchesRegex: '[a-zA-Z0-9_]' "evaluates to true"
>
> As far as I understand both have to give true... Or have I missed
> something?
>
> Regards,
> Steffen
> _______________________________________________________________________
> Jetzt neu! Schützen Sie Ihren PC mit McAfee und WEB.DE. 30 Tage
> kostenlos testen. http://www.pc-sicherheit.web.de/startseite/?mc=022220
>
>
>
> _______________________________________________
> vwnc mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
>
>

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Vassili Bykov-2

Re: [vwnc] Fwd: Regex11

I think \w should include underscore to match what seems to be the
expected behavior out there. Going a step further, perhaps a Character
predicate named something like #isIdentifierConstituent to include $_
in addition to pure alphanumerics would be useful.

Cheers,

--Vassili
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Steven Kelly

Re: [vwnc] Fwd: Regex11

In reply to this post by Steffen Märcker

OK! I've updated the code, tests and documentation and published as 1.3.

I second the vote for a Character predicate for this in the base - it would replace longer code in 7 places in a current standard image. I think the name should be #isWordConstituent, since "identifier" generally implies the first character is not numeric.

Steve

> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]] On
> Behalf Of Vassili Bykov
> Sent: 17 September 2008 00:16
> To: Steffen Märcker
> Cc: vwnc mailing list
> Subject: Re: [vwnc] Fwd: Regex11
>
> I think \w should include underscore to match what seems to be the
> expected behavior out there. Going a step further, perhaps a Character
> predicate named something like #isIdentifierConstituent to include $_
> in addition to pure alphanumerics would be useful.
>
> Cheers,
>
> --Vassili
> _______________________________________________
> vwnc mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Cesar Rabak

Re: [vwnc] Fwd: Regex11

Steven Kelly escreveu:
> OK! I've updated the code, tests and documentation and published as
> 1.3.
>
Great!
> I second the vote for a Character predicate for this in the base - it

Please add one to this vote!

> would replace longer code in 7 places in a current standard image. I
> think the name should be #isWordConstituent, since "identifier"
> generally implies the first character is not numeric.

Well, thinking about it I _believe_ #isWordConstituent is as ambiguous
as it gets since we do not use numbers for plain (English or natural
language) words as well, right?

Now if we stick to Vassili's suggestion, note that a character '_' or
'[0-9]' or '[A-Za-z]' are constituents of identifiers!

So for Character there is no error on having that predicated named that way.

my .019999...

--
Cesar Rabak
GNU/Linux User 52247.
Get counted: http://counter.li.org/
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Vassili Bykov-2

Re: [vwnc] Fwd: Regex11

In reply to this post by Steven Kelly

On Wed, Sep 17, 2008 at 4:24 AM, Steven Kelly <[hidden email]> wrote:
> OK! I've updated the code, tests and documentation and published as 1.3.

Thank you!

> I second the vote for a Character predicate for this in the base - it would replace longer code in 7 places in a current standard image. I think the name should be #isWordConstituent, since "identifier" generally implies the first character is not numeric.

#isWordConstituent was my first thought too, but "word" may suggest
other interpretations. For example, a locale-specific test for
characters acceptable in words of the current locale's language.

Cheers,

--Vassili
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Reinout Heeck

Re: [vwnc] Fwd: Regex11

In reply to this post by Cesar Rabak

Cesar Rabak wrote:
> Steven Kelly escreveu:
>> OK! I've updated the code, tests and documentation and published as
>> 1.3.
>>
> Great!
>> I second the vote for a Character predicate for this in the base - it
>
> Please add one to this vote!

+1

> Well, thinking about it I _believe_ #isWordConstituent is as ambiguous
> as it gets since we do not use numbers for plain (English or natural
> language) words as well, right?
>
> Now if we stick to Vassili's suggestion, note that a character '_' or
> '[0-9]' or '[A-Za-z]' are constituents of identifiers!

Hmm, maybe the selector should express which 'standard' we are following
for this class of characters.
Something like: #isRegexWordConstituent or #matchesRegexWord.

R
-
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Steven Kelly

Re: [vwnc] Fwd: Regex11

In reply to this post by Steffen Märcker

Cesar Rabak wrote:
> Well, thinking about it I _believe_ #isWordConstituent is as ambiguous

> as it gets since we do not use numbers for plain (English or natural
> language) words as well, right?

Good point about _ and [0-9] not being in English words! I blame the
regex designers who named or defined \w poorly :-)

Vassili Bykov wrote:
> #isWordConstituent was my first thought too, but "word" may suggest
> other interpretations. For example, a locale-specific test for
> characters acceptable in words of the current locale's language.

And #isIdentifierConstituent still tends to imply ASCII. Fair enough,
although it's not true in VW, which accepts non-ASCII characters in
identifiers. #isAlphaNumeric and #isAlphabetic only accept ASCII
characters. Oddly, #isVowel and #isUppercase allow non-ASCII characters.
Maybe we should have #isAlphaNumeric_ :-)

Steve

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Wolfgang Eder

Re: [vwnc] Fwd: Regex11

how 'bout #isRegexpWord?
cheers,
wolfgang

Steven Kelly wrote:

> Cesar Rabak wrote:
>> Well, thinking about it I _believe_ #isWordConstituent is as ambiguous
>
>> as it gets since we do not use numbers for plain (English or natural
>> language) words as well, right?
>
> Good point about _ and [0-9] not being in English words! I blame the
> regex designers who named or defined \w poorly :-)
>
> Vassili Bykov wrote:
>> #isWordConstituent was my first thought too, but "word" may suggest
>> other interpretations. For example, a locale-specific test for
>> characters acceptable in words of the current locale's language.
>
> And #isIdentifierConstituent still tends to imply ASCII. Fair enough,
> although it's not true in VW, which accepts non-ASCII characters in
> identifiers. #isAlphaNumeric and #isAlphabetic only accept ASCII
> characters. Oddly, #isVowel and #isUppercase allow non-ASCII characters.
> Maybe we should have #isAlphaNumeric_ :-)
>
> Steve
>
> _______________________________________________
> vwnc mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
>

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Vassili Bykov-2

Re: [vwnc] Fwd: Regex11

In reply to this post by Reinout Heeck

> Hmm, maybe the selector should express which 'standard' we are following
> for this class of characters.
> Something like: #isRegexWordConstituent or #matchesRegexWord.

Yep, I like #isRegexWordConstituent.

--Vassili
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Cesar Rabak

Re: [vwnc] Fwd: Regex11

In reply to this post by Steven Kelly

Steven Kelly escreveu:
> Cesar Rabak wrote:
>> Well, thinking about it I _believe_ #isWordConstituent is as ambiguous
>
>> as it gets since we do not use numbers for plain (English or natural
>> language) words as well, right?
>
> Good point about _ and [0-9] not being in English words! I blame the
> regex designers who named or defined \w poorly :-)
>

Yes, the blame is on atavism! We used in programming languages some
words as catachresis of real world concepts. \w is one of them ;-)

> Vassili Bykov wrote:
>> #isWordConstituent was my first thought too, but "word" may suggest
>> other interpretations. For example, a locale-specific test for
>> characters acceptable in words of the current locale's language.
>
> And #isIdentifierConstituent still tends to imply ASCII. Fair enough,

Yes I think so as well.

> although it's not true in VW, which accepts non-ASCII characters in
> identifiers. #isAlphaNumeric and #isAlphabetic only accept ASCII
> characters. Oddly, #isVowel and #isUppercase allow non-ASCII characters.
> Maybe we should have #isAlphaNumeric_ :-)

#isAlphaNumeric_ would work almost as it were in 'private' protocol
because I don't think we'll expend too much energy in educating users
about this. . . now from a (self) documentation point of view it seems a
nice touch :-)

--
Cesar Rabak
GNU/Linux User 52247.
Get counted: http://counter.li.org/
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Cesar Rabak

Re: [vwnc] Fwd: Regex11

In reply to this post by Steven Kelly

Steven Kelly escreveu:
> OK! I've updated the code, tests and documentation and published as 1.3.
>
Steve,

I think we still have a inconsistency with the docs. . .

The documentation states:

\w any word constituent character (same as [a-zA-Z0-9_])

To which Märcker's test cases:

'_' matchesRegex: '\w'
'_' matchesRegex: '[a-zA-Z0-9_]'

Both evaluate to true, so OK. However:

'_' matchesRegex: '[[:alnum:]]' "evaluates to false".

Whereas doc says:

<quote>
Character classes can also include the following grep(1)-compatible
elements to refer to:

[:alnum:] any alphanumeric, i.e., a word constituent, character
.
.
.
</quote>

HTH

--
Cesar Rabak
GNU/Linux User 52247.
Get counted: http://counter.li.org/
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Steven Kelly

Re: [vwnc] Fwd: Regex11

In reply to this post by Steffen Märcker

Cesar Rabak [mailto:[hidden email]] wrote:
> [:alnum:] any alphanumeric, i.e., a word constituent, character

Thanks, I missed that one since I was only checking the docs for \w, _, and underscore. I've corrected it and a couple of other mistakes in that list, and published as 1.3.1, also making the issues of locale explicit:

\w any word constituent character (same as [a-zA-Z0-9_])
\W any character but a word constituent
\d a digit (same as [0-9])
\D anything but a digit
\s a whitespace character (same as [:space:] below)
\S anything but a whitespace character

[:alnum:] any alphanumeric character (same as [a-zA-Z0-9])
[:alpha:] any alphabetic character (same as [a-zA-Z])
[:cntrl:] any control character. (any character with code < 32)
[:digit:] any decimal digit (same as [0-9])
[:graph:] any graphical character. (any character with code >= 32).
[:lower:] any lowercase character (including non-ASCII lowercase characters)
[:print:] any printable character. In this version, this is the same as [:graph:]
[:punct:] any punctuation character: . , ! ? ; : ' - ( ) ` and double quotes
[:space:] any whitespace character (space, tab, CR, LF, null, form feed, Ctrl-Z, 16r2000-16r200B, 16r3000)
[:upper:] any uppercase character (including non-ASCII uppercase characters)
[:xdigit:] any hexadecimal character (same as [a-fA-F0-9]).

Note that many of these are only as consistent or inconsistent on issues
of locale as the underlying Smalltalk implementation. Values shown here
are for VisualWorks 7.6.

Steve

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

Steffen Märcker

Re: [vwnc] Fwd: Regex11

Thanks to all for the interest and the fast changes.

I've got an additional question: If I want to check for foreign characters
like Chinese ideograms - how would an effective and efficient expression
look like? Is it just matching their unicode value?

Steffen

Am 17.09.2008, 23:00 Uhr, schrieb Steven Kelly <[hidden email]>:

> Cesar Rabak [mailto:[hidden email]] wrote:
>> [:alnum:] any alphanumeric, i.e., a word constituent, character
>
> Thanks, I missed that one since I was only checking the docs for \w, _,
> and underscore. I've corrected it and a couple of other mistakes in that
> list, and published as 1.3.1, also making the issues of locale explicit:
>
> \w any word constituent character (same as [a-zA-Z0-9_])
> \W any character but a word constituent
> \d a digit (same as [0-9])
> \D anything but a digit
> \s a whitespace character (same as [:space:] below)
> \S anything but a whitespace character
>
> [:alnum:] any alphanumeric character (same as [a-zA-Z0-9])
> [:alpha:] any alphabetic character (same as [a-zA-Z])
> [:cntrl:] any control character. (any character with code < 32)
> [:digit:] any decimal digit (same as [0-9])
> [:graph:] any graphical character. (any character with code >= 32).
> [:lower:] any lowercase character (including non-ASCII lowercase
> characters)
> [:print:] any printable character. In this version, this is the same as
> [:graph:]
> [:punct:] any punctuation character: . , ! ? ; : ' - ( ) ` and double
> quotes
> [:space:] any whitespace character (space, tab, CR, LF, null, form feed,
> Ctrl-Z, 16r2000-16r200B, 16r3000)
> [:upper:] any uppercase character (including non-ASCII uppercase
> characters)
> [:xdigit:] any hexadecimal character (same as [a-fA-F0-9]).
>
> Note that many of these are only as consistent or inconsistent on issues
> of locale as the underlying Smalltalk implementation. Values shown here
> are for VisualWorks 7.6.
>
> Steve

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc