Today I've written Vassili Bykov because of a litte bug in Regex11 which
could be intersting for you too. Any opinions? ------- Weitergeleitete Nachricht ------- An: "Vassili Bykov" <[hidden email]> Kopie: Betreff: Regex11 Datum: Mon, 15 Sep 2008 09:33:13 +0200 Dear Vassili Bykov, as far as I can see you've written the Regex11 package and I think I've found a bug. The documentation states: \w any word constituent character (same as [a-zA-Z0-9_]) but the actual implementation fails with underscores: '_' matchesRegex: '\w' "evaluates to false" '_' matchesRegex: '[a-zA-Z0-9_]' "evaluates to true" Should the bug be fixed because we rely on the definition of \w or is it the better way to adjust the documentation in order not to break existing applications? How can I help? Greetings, Steffen _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Steffen Märcker escreveu:
> Today I've written Vassili Bykov because of a litte bug in Regex11 which > could be intersting for you too. Any opinions? > > > ------- Weitergeleitete Nachricht ------- > An: "Vassili Bykov" <[hidden email]> > Kopie: > Betreff: Regex11 > Datum: Mon, 15 Sep 2008 09:33:13 +0200 > > Dear Vassili Bykov, > > as far as I can see you've written the Regex11 package and I think I've > found a bug. The documentation states: > > \w any word constituent character (same as [a-zA-Z0-9_]) > > but the actual implementation fails with underscores: > > '_' matchesRegex: '\w' "evaluates to false" > '_' matchesRegex: '[a-zA-Z0-9_]' "evaluates to true" > > Should the bug be fixed because we rely on the definition of \w or is it > the better way to adjust the documentation in order not to break existing > applications? How can I help? > > Greetings, I don't understand your "bug" report because the documentations states "...same as [a-zA-Z0-9*_*]" the stars are for emphasis. Isn't the documentations stating what the code does? Are you with expectations due *other* definitions for \w? -- Cesar Rabak GNU/Linux User 52247. Get counted: http://counter.li.org/ _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Steffen Märcker
Hi,
> Isn't the documentations stating what the code does? exactly, the documentation states '_' should match '\w', but the code does not. Just evaluate the given code examples: >> '_' matchesRegex: '\w' "evaluates to false" >> '_' matchesRegex: '[a-zA-Z0-9_]' "evaluates to true" As far as I understand both have to give true... Or have I missed something? Regards, Steffen _______________________________________________________________________ Jetzt neu! Schützen Sie Ihren PC mit McAfee und WEB.DE. 30 Tage kostenlos testen. http://www.pc-sicherheit.web.de/startseite/?mc=022220 _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
What Steffen states is obviously right..
From Smacc Scanner.html:
'_' or the underscore should be matched... regards skrish On Tue, Sep 16, 2008 at 12:22 PM, Steffen Märcker <[hidden email]> wrote: Hi, _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Steffen Märcker
Different regex libraries differ in whether \w includes
underscore. The current implementation of Regex11 does not have a bug per se:
the code clearly says “char isAlphaNumeric”; rather, the
documentation is incorrect. However, the move in regex libraries in general seems to be
towards including underscore: http://jakarta.apache.org/regexp/changes.html
(added in 1.3, 2003) http://nlp.stanford.edu/nlp/javadoc/gnu-regexp-docs/changes.html
(added in 1.1, 2001) http://savannah.gnu.org/bugs/?19637#discussion
(Gnu documentation “\w = [[:alnum]]” corrected 2007) How much do people feel adding underscore to \w in Regex11 would
be a breaking change, and how much a bug fix? Steve From:
[hidden email] [mailto:[hidden email]] On Behalf Of Sudhakar
Krishnamachari What Steffen states is obviously right.. From Smacc Scanner.html:
'_' or the underscore should be matched... regards skrish On Tue, Sep 16, 2008 at 12:22 PM, Steffen Märcker <[hidden email]> wrote: Hi,
exactly, the documentation states '_' should match '\w', but
the code does not. Just evaluate the given code examples:
As far as I understand both have to give true... Or have I
missed something?
_______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In my opinion it's a clear bug fix. ... and similarity to common
implementations is welcome - of course. +1 for the fix Am 16.09.2008, 13:59 Uhr, schrieb Steven Kelly <[hidden email]>: > Different regex libraries differ in whether \w includes underscore. The > current implementation of Regex11 does not have a bug per se: the code > clearly says "char isAlphaNumeric"; rather, the documentation is > incorrect. > > > However, the move in regex libraries in general seems to be towards > including underscore: > > > http://jakarta.apache.org/regexp/changes.html (added in 1.3, 2003) > > http://nlp.stanford.edu/nlp/javadoc/gnu-regexp-docs/changes.html (added > in 1.1, 2001) > > http://savannah.gnu.org/bugs/?19637#discussion (Gnu documentation "\w = > [[:alnum]]" corrected 2007) > > > How much do people feel adding underscore to \w in Regex11 would be a > breaking change, and how much a bug fix? > > > Steve > > > > From: [hidden email] [mailto:[hidden email]] On > Behalf Of Sudhakar Krishnamachari > Sent: 16 September 2008 13:48 > To: Steffen Märcker > Cc: vwnc mailing list; [hidden email] > Subject: Re: [vwnc] Fwd: Regex11 > > > What Steffen states is obviously right.. > > >> From Smacc Scanner.html: > > > \w > > Matches any letter, number or underscore, [A-Za-z0-9_]. > > > '_' or the underscore should be matched... > > > regards > > skrish > > > On Tue, Sep 16, 2008 at 12:22 PM, Steffen Märcker <[hidden email]> wrote: > > Hi, > > >> Isn't the documentations stating what the code does? > > exactly, the documentation states '_' should match '\w', but the code > does not. Just evaluate the given code examples: > > >>> '_' matchesRegex: '\w' "evaluates to false" >>> '_' matchesRegex: '[a-zA-Z0-9_]' "evaluates to true" > > As far as I understand both have to give true... Or have I missed > something? > > Regards, > Steffen > _______________________________________________________________________ > Jetzt neu! Schützen Sie Ihren PC mit McAfee und WEB.DE. 30 Tage > kostenlos testen. http://www.pc-sicherheit.web.de/startseite/?mc=022220 > > > > _______________________________________________ > vwnc mailing list > [hidden email] > http://lists.cs.uiuc.edu/mailman/listinfo/vwnc > > _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
I think \w should include underscore to match what seems to be the
expected behavior out there. Going a step further, perhaps a Character predicate named something like #isIdentifierConstituent to include $_ in addition to pure alphanumerics would be useful. Cheers, --Vassili _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Steffen Märcker
OK! I've updated the code, tests and documentation and published as 1.3.
I second the vote for a Character predicate for this in the base - it would replace longer code in 7 places in a current standard image. I think the name should be #isWordConstituent, since "identifier" generally implies the first character is not numeric. Steve > -----Original Message----- > From: [hidden email] [mailto:[hidden email]] On > Behalf Of Vassili Bykov > Sent: 17 September 2008 00:16 > To: Steffen Märcker > Cc: vwnc mailing list > Subject: Re: [vwnc] Fwd: Regex11 > > I think \w should include underscore to match what seems to be the > expected behavior out there. Going a step further, perhaps a Character > predicate named something like #isIdentifierConstituent to include $_ > in addition to pure alphanumerics would be useful. > > Cheers, > > --Vassili > _______________________________________________ > vwnc mailing list > [hidden email] > http://lists.cs.uiuc.edu/mailman/listinfo/vwnc _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Steven Kelly escreveu:
> OK! I've updated the code, tests and documentation and published as > 1.3. > Great! > I second the vote for a Character predicate for this in the base - it Please add one to this vote! > would replace longer code in 7 places in a current standard image. I > think the name should be #isWordConstituent, since "identifier" > generally implies the first character is not numeric. Well, thinking about it I _believe_ #isWordConstituent is as ambiguous as it gets since we do not use numbers for plain (English or natural language) words as well, right? Now if we stick to Vassili's suggestion, note that a character '_' or '[0-9]' or '[A-Za-z]' are constituents of identifiers! So for Character there is no error on having that predicated named that way. my .019999... -- Cesar Rabak GNU/Linux User 52247. Get counted: http://counter.li.org/ _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Steven Kelly
On Wed, Sep 17, 2008 at 4:24 AM, Steven Kelly <[hidden email]> wrote:
> OK! I've updated the code, tests and documentation and published as 1.3. Thank you! > I second the vote for a Character predicate for this in the base - it would replace longer code in 7 places in a current standard image. I think the name should be #isWordConstituent, since "identifier" generally implies the first character is not numeric. #isWordConstituent was my first thought too, but "word" may suggest other interpretations. For example, a locale-specific test for characters acceptable in words of the current locale's language. Cheers, --Vassili _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Cesar Rabak
Cesar Rabak wrote:
> Steven Kelly escreveu: >> OK! I've updated the code, tests and documentation and published as >> 1.3. >> > Great! >> I second the vote for a Character predicate for this in the base - it > > Please add one to this vote! +1 > Well, thinking about it I _believe_ #isWordConstituent is as ambiguous > as it gets since we do not use numbers for plain (English or natural > language) words as well, right? > > Now if we stick to Vassili's suggestion, note that a character '_' or > '[0-9]' or '[A-Za-z]' are constituents of identifiers! Hmm, maybe the selector should express which 'standard' we are following for this class of characters. Something like: #isRegexWordConstituent or #matchesRegexWord. R - _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Steffen Märcker
Cesar Rabak wrote:
> Well, thinking about it I _believe_ #isWordConstituent is as ambiguous > as it gets since we do not use numbers for plain (English or natural > language) words as well, right? Good point about _ and [0-9] not being in English words! I blame the regex designers who named or defined \w poorly :-) Vassili Bykov wrote: > #isWordConstituent was my first thought too, but "word" may suggest > other interpretations. For example, a locale-specific test for > characters acceptable in words of the current locale's language. And #isIdentifierConstituent still tends to imply ASCII. Fair enough, although it's not true in VW, which accepts non-ASCII characters in identifiers. #isAlphaNumeric and #isAlphabetic only accept ASCII characters. Oddly, #isVowel and #isUppercase allow non-ASCII characters. Maybe we should have #isAlphaNumeric_ :-) Steve _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
how 'bout #isRegexpWord?
cheers, wolfgang Steven Kelly wrote: > Cesar Rabak wrote: >> Well, thinking about it I _believe_ #isWordConstituent is as ambiguous > >> as it gets since we do not use numbers for plain (English or natural >> language) words as well, right? > > Good point about _ and [0-9] not being in English words! I blame the > regex designers who named or defined \w poorly :-) > > Vassili Bykov wrote: >> #isWordConstituent was my first thought too, but "word" may suggest >> other interpretations. For example, a locale-specific test for >> characters acceptable in words of the current locale's language. > > And #isIdentifierConstituent still tends to imply ASCII. Fair enough, > although it's not true in VW, which accepts non-ASCII characters in > identifiers. #isAlphaNumeric and #isAlphabetic only accept ASCII > characters. Oddly, #isVowel and #isUppercase allow non-ASCII characters. > Maybe we should have #isAlphaNumeric_ :-) > > Steve > > _______________________________________________ > vwnc mailing list > [hidden email] > http://lists.cs.uiuc.edu/mailman/listinfo/vwnc > _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Reinout Heeck
> Hmm, maybe the selector should express which 'standard' we are following
> for this class of characters. > Something like: #isRegexWordConstituent or #matchesRegexWord. Yep, I like #isRegexWordConstituent. --Vassili _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Steven Kelly
Steven Kelly escreveu:
> Cesar Rabak wrote: >> Well, thinking about it I _believe_ #isWordConstituent is as ambiguous > >> as it gets since we do not use numbers for plain (English or natural >> language) words as well, right? > > Good point about _ and [0-9] not being in English words! I blame the > regex designers who named or defined \w poorly :-) > Yes, the blame is on atavism! We used in programming languages some words as catachresis of real world concepts. \w is one of them ;-) > Vassili Bykov wrote: >> #isWordConstituent was my first thought too, but "word" may suggest >> other interpretations. For example, a locale-specific test for >> characters acceptable in words of the current locale's language. > > And #isIdentifierConstituent still tends to imply ASCII. Fair enough, Yes I think so as well. > although it's not true in VW, which accepts non-ASCII characters in > identifiers. #isAlphaNumeric and #isAlphabetic only accept ASCII > characters. Oddly, #isVowel and #isUppercase allow non-ASCII characters. > Maybe we should have #isAlphaNumeric_ :-) #isAlphaNumeric_ would work almost as it were in 'private' protocol because I don't think we'll expend too much energy in educating users about this. . . now from a (self) documentation point of view it seems a nice touch :-) -- Cesar Rabak GNU/Linux User 52247. Get counted: http://counter.li.org/ _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Steven Kelly
Steven Kelly escreveu:
> OK! I've updated the code, tests and documentation and published as 1.3. > Steve, I think we still have a inconsistency with the docs. . . The documentation states: \w any word constituent character (same as [a-zA-Z0-9_]) To which Märcker's test cases: '_' matchesRegex: '\w' '_' matchesRegex: '[a-zA-Z0-9_]' Both evaluate to true, so OK. However: '_' matchesRegex: '[[:alnum:]]' "evaluates to false". Whereas doc says: <quote> Character classes can also include the following grep(1)-compatible elements to refer to: [:alnum:] any alphanumeric, i.e., a word constituent, character . . . </quote> HTH -- Cesar Rabak GNU/Linux User 52247. Get counted: http://counter.li.org/ _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
In reply to this post by Steffen Märcker
Cesar Rabak [mailto:[hidden email]] wrote:
> [:alnum:] any alphanumeric, i.e., a word constituent, character Thanks, I missed that one since I was only checking the docs for \w, _, and underscore. I've corrected it and a couple of other mistakes in that list, and published as 1.3.1, also making the issues of locale explicit: \w any word constituent character (same as [a-zA-Z0-9_]) \W any character but a word constituent \d a digit (same as [0-9]) \D anything but a digit \s a whitespace character (same as [:space:] below) \S anything but a whitespace character [:alnum:] any alphanumeric character (same as [a-zA-Z0-9]) [:alpha:] any alphabetic character (same as [a-zA-Z]) [:cntrl:] any control character. (any character with code < 32) [:digit:] any decimal digit (same as [0-9]) [:graph:] any graphical character. (any character with code >= 32). [:lower:] any lowercase character (including non-ASCII lowercase characters) [:print:] any printable character. In this version, this is the same as [:graph:] [:punct:] any punctuation character: . , ! ? ; : ' - ( ) ` and double quotes [:space:] any whitespace character (space, tab, CR, LF, null, form feed, Ctrl-Z, 16r2000-16r200B, 16r3000) [:upper:] any uppercase character (including non-ASCII uppercase characters) [:xdigit:] any hexadecimal character (same as [a-fA-F0-9]). Note that many of these are only as consistent or inconsistent on issues of locale as the underlying Smalltalk implementation. Values shown here are for VisualWorks 7.6. Steve _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Thanks to all for the interest and the fast changes.
I've got an additional question: If I want to check for foreign characters like Chinese ideograms - how would an effective and efficient expression look like? Is it just matching their unicode value? Steffen Am 17.09.2008, 23:00 Uhr, schrieb Steven Kelly <[hidden email]>: > Cesar Rabak [mailto:[hidden email]] wrote: >> [:alnum:] any alphanumeric, i.e., a word constituent, character > > Thanks, I missed that one since I was only checking the docs for \w, _, > and underscore. I've corrected it and a couple of other mistakes in that > list, and published as 1.3.1, also making the issues of locale explicit: > > \w any word constituent character (same as [a-zA-Z0-9_]) > \W any character but a word constituent > \d a digit (same as [0-9]) > \D anything but a digit > \s a whitespace character (same as [:space:] below) > \S anything but a whitespace character > > [:alnum:] any alphanumeric character (same as [a-zA-Z0-9]) > [:alpha:] any alphabetic character (same as [a-zA-Z]) > [:cntrl:] any control character. (any character with code < 32) > [:digit:] any decimal digit (same as [0-9]) > [:graph:] any graphical character. (any character with code >= 32). > [:lower:] any lowercase character (including non-ASCII lowercase > characters) > [:print:] any printable character. In this version, this is the same as > [:graph:] > [:punct:] any punctuation character: . , ! ? ; : ' - ( ) ` and double > quotes > [:space:] any whitespace character (space, tab, CR, LF, null, form feed, > Ctrl-Z, 16r2000-16r200B, 16r3000) > [:upper:] any uppercase character (including non-ASCII uppercase > characters) > [:xdigit:] any hexadecimal character (same as [a-fA-F0-9]). > > Note that many of these are only as consistent or inconsistent on issues > of locale as the underlying Smalltalk implementation. Values shown here > are for VisualWorks 7.6. > > Steve _______________________________________________ vwnc mailing list [hidden email] http://lists.cs.uiuc.edu/mailman/listinfo/vwnc |
Free forum by Nabble | Edit this page |