[vwnc] Fwd: Regex11

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

[vwnc] Fwd: Regex11

Steffen Märcker
Today I've written Vassili Bykov because of a litte bug in Regex11 which  
could be intersting for you too. Any opinions?


------- Weitergeleitete Nachricht -------
An: "Vassili Bykov" <[hidden email]>
Kopie:
Betreff: Regex11
Datum: Mon, 15 Sep 2008 09:33:13 +0200

Dear Vassili Bykov,

as far as I can see you've written the Regex11 package and I think I've
found a bug. The documentation states:

        \w any word constituent character (same as [a-zA-Z0-9_])

but the actual implementation fails with underscores:

        '_' matchesRegex: '\w' "evaluates to false"
        '_' matchesRegex: '[a-zA-Z0-9_]' "evaluates to true"

Should the bug be fixed because we rely on the definition of \w or is it
the better way to adjust the documentation in order not to break existing
applications? How can I help?

Greetings,
Steffen
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] Fwd: Regex11

Cesar Rabak
Steffen Märcker escreveu:

> Today I've written Vassili Bykov because of a litte bug in Regex11 which  
> could be intersting for you too. Any opinions?
>
>
> ------- Weitergeleitete Nachricht -------
> An: "Vassili Bykov" <[hidden email]>
> Kopie:
> Betreff: Regex11
> Datum: Mon, 15 Sep 2008 09:33:13 +0200
>
> Dear Vassili Bykov,
>
> as far as I can see you've written the Regex11 package and I think I've
> found a bug. The documentation states:
>
> \w any word constituent character (same as [a-zA-Z0-9_])
>
> but the actual implementation fails with underscores:
>
> '_' matchesRegex: '\w' "evaluates to false"
> '_' matchesRegex: '[a-zA-Z0-9_]' "evaluates to true"
>
> Should the bug be fixed because we rely on the definition of \w or is it
> the better way to adjust the documentation in order not to break existing
> applications? How can I help?
>
> Greetings,
Steffen,

I don't understand your "bug" report because the documentations states
"...same as [a-zA-Z0-9*_*]" the stars are for emphasis.

Isn't the documentations stating what the code does?

Are you with expectations due *other* definitions for \w?



--
Cesar Rabak
GNU/Linux User 52247.
Get counted: http://counter.li.org/
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] Fwd: Regex11

Steffen Märcker
In reply to this post by Steffen Märcker
Hi,

> Isn't the documentations stating what the code does?

exactly, the documentation states '_' should match '\w', but the code does not. Just evaluate the given code examples:

>>   '_' matchesRegex: '\w' "evaluates to false"
>> '_' matchesRegex: '[a-zA-Z0-9_]' "evaluates to true"

As far as I understand both have to give true... Or have I missed something?

Regards,
Steffen
_______________________________________________________________________
Jetzt neu! Schützen Sie Ihren PC mit McAfee und WEB.DE. 30 Tage
kostenlos testen. http://www.pc-sicherheit.web.de/startseite/?mc=022220


_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] Fwd: Regex11

skrish
What Steffen states is obviously right.. 

From Smacc Scanner.html:

\wMatches any letter, number or underscore, [A-Za-z0-9_].

'_'  or the underscore should be matched...

regards
skrish

On Tue, Sep 16, 2008 at 12:22 PM, Steffen Märcker <[hidden email]> wrote:
Hi,

> Isn't the documentations stating what the code does?

exactly, the documentation states '_' should match '\w', but the code does not. Just evaluate the given code examples:

>>      '_' matchesRegex: '\w'                  "evaluates to false"
>>      '_' matchesRegex: '[a-zA-Z0-9_]'        "evaluates to true"

As far as I understand both have to give true... Or have I missed something?

Regards,
Steffen
_______________________________________________________________________
Jetzt neu! Schützen Sie Ihren PC mit McAfee und WEB.DE. 30 Tage
kostenlos testen. http://www.pc-sicherheit.web.de/startseite/?mc=022220


_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc


_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] Fwd: Regex11

Steven Kelly
In reply to this post by Steffen Märcker

Different regex libraries differ in whether \w includes underscore. The current implementation of Regex11 does not have a bug per se: the code clearly says “char isAlphaNumeric”; rather, the documentation is incorrect.

 

However, the move in regex libraries in general seems to be towards including underscore:

 

http://jakarta.apache.org/regexp/changes.html (added in 1.3, 2003)

http://nlp.stanford.edu/nlp/javadoc/gnu-regexp-docs/changes.html (added in 1.1, 2001)

http://savannah.gnu.org/bugs/?19637#discussion (Gnu documentation “\w = [[:alnum]]” corrected 2007)

 

How much do people feel adding underscore to \w in Regex11 would be a breaking change, and how much a bug fix?

 

Steve

 

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of Sudhakar Krishnamachari
Sent: 16 September 2008 13:48
To: Steffen Märcker
Cc: vwnc mailing list; [hidden email]
Subject: Re: [vwnc] Fwd: Regex11

 

What Steffen states is obviously right.. 

 

From Smacc Scanner.html:

 

\w

Matches any letter, number or underscore, [A-Za-z0-9_].

 

'_'  or the underscore should be matched...

 

regards

skrish

 

On Tue, Sep 16, 2008 at 12:22 PM, Steffen Märcker <[hidden email]> wrote:

Hi,


> Isn't the documentations stating what the code does?

exactly, the documentation states '_' should match '\w', but the code does not. Just evaluate the given code examples:


>>      '_' matchesRegex: '\w'                  "evaluates to false"
>>      '_' matchesRegex: '[a-zA-Z0-9_]'        "evaluates to true"

As far as I understand both have to give true... Or have I missed something?

Regards,
Steffen
_______________________________________________________________________
Jetzt neu! Schützen Sie Ihren PC mit McAfee und WEB.DE. 30 Tage
kostenlos testen. http://www.pc-sicherheit.web.de/startseite/?mc=022220



_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

 


_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] Fwd: Regex11

Steffen Märcker
In my opinion it's a clear bug fix. ... and similarity to common  
implementations is welcome - of course.

+1 for the fix




Am 16.09.2008, 13:59 Uhr, schrieb Steven Kelly <[hidden email]>:

> Different regex libraries differ in whether \w includes underscore. The  
> current implementation of Regex11 does not have a bug per se: the code  
> clearly says "char isAlphaNumeric"; rather, the documentation is  
> incorrect.
>
>
> However, the move in regex libraries in general seems to be towards  
> including underscore:
>
>
> http://jakarta.apache.org/regexp/changes.html (added in 1.3, 2003)
>
> http://nlp.stanford.edu/nlp/javadoc/gnu-regexp-docs/changes.html (added  
> in 1.1, 2001)
>
> http://savannah.gnu.org/bugs/?19637#discussion (Gnu documentation "\w =  
> [[:alnum]]" corrected 2007)
>
>
> How much do people feel adding underscore to \w in Regex11 would be a  
> breaking change, and how much a bug fix?
>
>
> Steve
>
>
>
> From: [hidden email] [mailto:[hidden email]] On  
> Behalf Of Sudhakar Krishnamachari
> Sent: 16 September 2008 13:48
> To: Steffen Märcker
> Cc: vwnc mailing list; [hidden email]
> Subject: Re: [vwnc] Fwd: Regex11
>
>
> What Steffen states is obviously right..
>
>
>> From Smacc Scanner.html:
>
>
> \w
>
> Matches any letter, number or underscore, [A-Za-z0-9_].
>
>
> '_'  or the underscore should be matched...
>
>
> regards
>
> skrish
>
>
> On Tue, Sep 16, 2008 at 12:22 PM, Steffen Märcker <[hidden email]> wrote:
>
> Hi,
>
>
>> Isn't the documentations stating what the code does?
>
> exactly, the documentation states '_' should match '\w', but the code  
> does not. Just evaluate the given code examples:
>
>
>>>      '_' matchesRegex: '\w'                  "evaluates to false"
>>>      '_' matchesRegex: '[a-zA-Z0-9_]'        "evaluates to true"
>
> As far as I understand both have to give true... Or have I missed  
> something?
>
> Regards,
> Steffen
> _______________________________________________________________________
> Jetzt neu! Schützen Sie Ihren PC mit McAfee und WEB.DE. 30 Tage
> kostenlos testen. http://www.pc-sicherheit.web.de/startseite/?mc=022220
>
>
>
> _______________________________________________
> vwnc mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
>
>

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] Fwd: Regex11

Vassili Bykov-2
I think \w should include underscore to match what seems to be the
expected behavior out there. Going a step further, perhaps a Character
predicate named something like #isIdentifierConstituent to include $_
in addition to pure alphanumerics would be useful.

Cheers,

--Vassili
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] Fwd: Regex11

Steven Kelly
In reply to this post by Steffen Märcker
OK! I've updated the code, tests and documentation and published as 1.3.

I second the vote for a Character predicate for this in the base - it would replace longer code in 7 places in a current standard image. I think the name should be #isWordConstituent, since "identifier" generally implies the first character is not numeric.

Steve

> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]] On
> Behalf Of Vassili Bykov
> Sent: 17 September 2008 00:16
> To: Steffen Märcker
> Cc: vwnc mailing list
> Subject: Re: [vwnc] Fwd: Regex11
>
> I think \w should include underscore to match what seems to be the
> expected behavior out there. Going a step further, perhaps a Character
> predicate named something like #isIdentifierConstituent to include $_
> in addition to pure alphanumerics would be useful.
>
> Cheers,
>
> --Vassili
> _______________________________________________
> vwnc mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] Fwd: Regex11

Cesar Rabak
Steven Kelly escreveu:
> OK! I've updated the code, tests and documentation and published as
> 1.3.
>
Great!
> I second the vote for a Character predicate for this in the base - it

Please add one to this vote!

> would replace longer code in 7 places in a current standard image. I
> think the name should be #isWordConstituent, since "identifier"
> generally implies the first character is not numeric.

Well, thinking about it I _believe_ #isWordConstituent is as ambiguous
as it gets since we do not use numbers for plain (English or natural
language) words as well, right?

Now if we stick to Vassili's suggestion, note that a character '_' or
'[0-9]' or '[A-Za-z]' are constituents of identifiers!

So for Character there is no error on having that predicated named that way.

my .019999...

--
Cesar Rabak
GNU/Linux User 52247.
Get counted: http://counter.li.org/
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] Fwd: Regex11

Vassili Bykov-2
In reply to this post by Steven Kelly
On Wed, Sep 17, 2008 at 4:24 AM, Steven Kelly <[hidden email]> wrote:
> OK! I've updated the code, tests and documentation and published as 1.3.

Thank you!

> I second the vote for a Character predicate for this in the base - it would replace longer code in 7 places in a current standard image. I think the name should be #isWordConstituent, since "identifier" generally implies the first character is not numeric.

#isWordConstituent was my first thought too, but "word" may suggest
other interpretations. For example, a locale-specific test for
characters acceptable in words of the current locale's language.

Cheers,

--Vassili
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] Fwd: Regex11

Reinout Heeck
In reply to this post by Cesar Rabak
Cesar Rabak wrote:
> Steven Kelly escreveu:
>> OK! I've updated the code, tests and documentation and published as
>> 1.3.
>>
> Great!
>> I second the vote for a Character predicate for this in the base - it
>
> Please add one to this vote!

+1


> Well, thinking about it I _believe_ #isWordConstituent is as ambiguous
> as it gets since we do not use numbers for plain (English or natural
> language) words as well, right?
>
> Now if we stick to Vassili's suggestion, note that a character '_' or
> '[0-9]' or '[A-Za-z]' are constituents of identifiers!

Hmm, maybe the selector should express which 'standard' we are following
for this class of characters.
Something like: #isRegexWordConstituent or #matchesRegexWord.





R
-
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] Fwd: Regex11

Steven Kelly
In reply to this post by Steffen Märcker
Cesar Rabak wrote:
> Well, thinking about it I _believe_ #isWordConstituent is as ambiguous

> as it gets since we do not use numbers for plain (English or natural
> language) words as well, right?

Good point about _ and [0-9] not being in English words! I blame the
regex designers who named or defined \w poorly :-)

Vassili Bykov wrote:
> #isWordConstituent was my first thought too, but "word" may suggest
> other interpretations. For example, a locale-specific test for
> characters acceptable in words of the current locale's language.

And #isIdentifierConstituent still tends to imply ASCII. Fair enough,
although it's not true in VW, which accepts non-ASCII characters in
identifiers. #isAlphaNumeric and #isAlphabetic only accept ASCII
characters. Oddly, #isVowel and #isUppercase allow non-ASCII characters.
Maybe we should have #isAlphaNumeric_ :-)

Steve

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] Fwd: Regex11

Wolfgang Eder
how 'bout #isRegexpWord?
cheers,
wolfgang

Steven Kelly wrote:

> Cesar Rabak wrote:
>> Well, thinking about it I _believe_ #isWordConstituent is as ambiguous
>
>> as it gets since we do not use numbers for plain (English or natural
>> language) words as well, right?
>
> Good point about _ and [0-9] not being in English words! I blame the
> regex designers who named or defined \w poorly :-)
>
> Vassili Bykov wrote:
>> #isWordConstituent was my first thought too, but "word" may suggest
>> other interpretations. For example, a locale-specific test for
>> characters acceptable in words of the current locale's language.
>
> And #isIdentifierConstituent still tends to imply ASCII. Fair enough,
> although it's not true in VW, which accepts non-ASCII characters in
> identifiers. #isAlphaNumeric and #isAlphabetic only accept ASCII
> characters. Oddly, #isVowel and #isUppercase allow non-ASCII characters.
> Maybe we should have #isAlphaNumeric_ :-)
>
> Steve
>
> _______________________________________________
> vwnc mailing list
> [hidden email]
> http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
>

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] Fwd: Regex11

Vassili Bykov-2
In reply to this post by Reinout Heeck
> Hmm, maybe the selector should express which 'standard' we are following
> for this class of characters.
> Something like: #isRegexWordConstituent or #matchesRegexWord.

Yep, I like #isRegexWordConstituent.

--Vassili
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] Fwd: Regex11

Cesar Rabak
In reply to this post by Steven Kelly
Steven Kelly escreveu:
> Cesar Rabak wrote:
>> Well, thinking about it I _believe_ #isWordConstituent is as ambiguous
>
>> as it gets since we do not use numbers for plain (English or natural
>> language) words as well, right?
>
> Good point about _ and [0-9] not being in English words! I blame the
> regex designers who named or defined \w poorly :-)
>

Yes, the blame is on atavism! We used in programming languages some
words as catachresis of real world concepts. \w is one of them ;-)

> Vassili Bykov wrote:
>> #isWordConstituent was my first thought too, but "word" may suggest
>> other interpretations. For example, a locale-specific test for
>> characters acceptable in words of the current locale's language.
>
> And #isIdentifierConstituent still tends to imply ASCII. Fair enough,

Yes I think so as well.

> although it's not true in VW, which accepts non-ASCII characters in
> identifiers. #isAlphaNumeric and #isAlphabetic only accept ASCII
> characters. Oddly, #isVowel and #isUppercase allow non-ASCII characters.
> Maybe we should have #isAlphaNumeric_ :-)

#isAlphaNumeric_ would work almost as it were in 'private' protocol
because I don't think we'll expend too much energy in educating users
about this. . . now from a (self) documentation point of view it seems a
nice touch :-)

--
Cesar Rabak
GNU/Linux User 52247.
Get counted: http://counter.li.org/
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] Fwd: Regex11

Cesar Rabak
In reply to this post by Steven Kelly
Steven Kelly escreveu:
> OK! I've updated the code, tests and documentation and published as 1.3.
>
Steve,

I think we still have a inconsistency with the docs. . .

The documentation states:

        \w any word constituent character (same as [a-zA-Z0-9_])

To which Märcker's test cases:

        '_' matchesRegex: '\w'
        '_' matchesRegex: '[a-zA-Z0-9_]'

Both evaluate to true, so OK. However:

        '_' matchesRegex: '[[:alnum:]]' "evaluates to false".

Whereas doc says:

<quote>
Character classes can also include the following grep(1)-compatible
elements to refer to:

        [:alnum:] any alphanumeric, i.e., a word constituent, character
.
.
.
</quote>


HTH

--
Cesar Rabak
GNU/Linux User 52247.
Get counted: http://counter.li.org/
_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] Fwd: Regex11

Steven Kelly
In reply to this post by Steffen Märcker
Cesar Rabak [mailto:[hidden email]] wrote:
> [:alnum:] any alphanumeric, i.e., a word constituent, character

Thanks, I missed that one since I was only checking the docs for \w, _, and underscore. I've corrected it and a couple of other mistakes in that list, and published as 1.3.1, also making the issues of locale explicit:

\w any word constituent character (same as [a-zA-Z0-9_])
\W any character but a word constituent
\d a digit (same as [0-9])
\D anything but a digit
\s a whitespace character (same as [:space:] below)
\S anything but a whitespace character

[:alnum:] any alphanumeric character (same as [a-zA-Z0-9])
[:alpha:] any alphabetic character (same as [a-zA-Z])
[:cntrl:] any control character. (any character with code < 32)
[:digit:] any decimal digit (same as [0-9])
[:graph:] any graphical character. (any character with code >= 32).
[:lower:] any lowercase character (including non-ASCII lowercase characters)
[:print:] any printable character. In this version, this is the same as [:graph:]
[:punct:] any punctuation character:  . , ! ? ; : ' - ( ) ` and double quotes
[:space:] any whitespace character (space, tab, CR, LF, null, form feed, Ctrl-Z, 16r2000-16r200B, 16r3000)
[:upper:] any uppercase character (including non-ASCII uppercase characters)
[:xdigit:] any hexadecimal character (same as [a-fA-F0-9]).

Note that many of these are only as consistent or inconsistent on issues
of locale as the underlying Smalltalk implementation. Values shown here
are for VisualWorks 7.6.

Steve

_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc
Reply | Threaded
Open this post in threaded view
|

Re: [vwnc] Fwd: Regex11

Steffen Märcker
Thanks to all for the interest and the fast changes.

I've got an additional question: If I want to check for foreign characters  
like Chinese ideograms - how would an effective and efficient expression  
look like? Is it just matching their unicode value?

Steffen

Am 17.09.2008, 23:00 Uhr, schrieb Steven Kelly <[hidden email]>:

> Cesar Rabak [mailto:[hidden email]] wrote:
>> [:alnum:] any alphanumeric, i.e., a word constituent, character
>
> Thanks, I missed that one since I was only checking the docs for \w, _,  
> and underscore. I've corrected it and a couple of other mistakes in that  
> list, and published as 1.3.1, also making the issues of locale explicit:
>
> \w any word constituent character (same as [a-zA-Z0-9_])
> \W any character but a word constituent
> \d a digit (same as [0-9])
> \D anything but a digit
> \s a whitespace character (same as [:space:] below)
> \S anything but a whitespace character
>
> [:alnum:] any alphanumeric character (same as [a-zA-Z0-9])
> [:alpha:] any alphabetic character (same as [a-zA-Z])
> [:cntrl:] any control character. (any character with code < 32)
> [:digit:] any decimal digit (same as [0-9])
> [:graph:] any graphical character. (any character with code >= 32).
> [:lower:] any lowercase character (including non-ASCII lowercase  
> characters)
> [:print:] any printable character. In this version, this is the same as  
> [:graph:]
> [:punct:] any punctuation character:  . , ! ? ; : ' - ( ) ` and double  
> quotes
> [:space:] any whitespace character (space, tab, CR, LF, null, form feed,  
> Ctrl-Z, 16r2000-16r200B, 16r3000)
> [:upper:] any uppercase character (including non-ASCII uppercase  
> characters)
> [:xdigit:] any hexadecimal character (same as [a-fA-F0-9]).
>
> Note that many of these are only as consistent or inconsistent on issues
> of locale as the underlying Smalltalk implementation. Values shown here
> are for VisualWorks 7.6.
>
> Steve


_______________________________________________
vwnc mailing list
[hidden email]
http://lists.cs.uiuc.edu/mailman/listinfo/vwnc