Hi Folks, My parser rules are not being invoked for certain character classes. For example, look at the İstanbul at this link: https://en.wikipedia.org/w/index.php?title=Template:%C4%B0stanbul_B%C3%BCy%C3%BCk%C5%9Fehir_Belediyesi_sections&action=edit The PEG specifieds its own grammar and in that there is a "regex" character classes defined thusly.
which specifies an '\' followed by an 'x' followed by 6 characters in the 0-9 and A-F ranges. i.e. \x000FOO I am guessing, but do not know, that I need a character class similar to the above that will handle the funky twirly above the "I" in İstanbul I have been using a code smell rule I call DotNot i.e. not the dot that has clearly outlived its usefullness....
That rule is used by other rules, to say "accept these characters" and for the links in the linked example that rule looks like:
which does a great job on English, but barfs on Instanbul
As you can see, the funky "I" is not in DotNot. I have flailed around aimlessly here: https://regexr.com/ to no avail. Pointers appreciated. cordially, |
> On 10. Jan 2021, at 14:17, gettimothy via Squeak-dev <[hidden email]> wrote: > > Hi Folks, > > My parser rules are not being invoked for certain character classes. > > For example, look at the İstanbul at this link: https://en.wikipedia.org/w/index.php?title=Template:%C4%B0stanbul_B%C3%BCy%C3%BCk%C5%9Fehir_Belediyesi_sections&action=edit > > > The PEG specifieds its own grammar and in that there is a "regex" character classes defined thusly. > > Escape <- BACKSLASH [x] [0-9A-F]{6} > > > which specifies an '\' followed by an 'x' followed by 6 characters in the 0-9 and A-F ranges. i.e. \x000FOO > > > I am guessing, but do not know, that I need a character class similar to the above that will handle the funky twirly above the "I" in İstanbul > > I have been using a code smell rule I call DotNot i.e. not the dot that has clearly outlived its usefullness.... > > DotNot <- [a-zA-Z0-9_\s\t\-\+\.\;\:\"\&\#\?\%\!\<\>\/\,\=\''\`\(\)\w] I think the \w does not do here what you think. What happens is that the upper case I with dot above is encoded as UTF-8-Sequences-Percent-Encoded. So you need a parser that is (a) aware of URL percent-escaping and (b) unicode/utf-8. That is not what you DotNot does. It can only ascii, I presume… What kind of Regex-lib do you use? Best regards -Tobias > > That rule is used by other rules, to say "accept these characters" and for the links in the linked example that rule looks like: > > LinkFreeCaptioned <- OPEN_BRACKET{2} DotNot* PipeCaption > > which does a great job on English, but barfs on Instanbul > > [[İstanbul Büyükşehir Belediyespor (basketball)|Basketball]] > As you can see, the funky "I" is not in DotNot. > > > I have flailed around aimlessly here: https://regexr.com/ to no avail. > > Pointers appreciated. > > cordially, > > > > |
Hi Tobias. Thanks for the reply.
|
In reply to this post by Tobias Pape
poking around in the grammar some more, it defines RangeSet with a rule
That is tied to a callback via a pragma...
Which is too deep in the weeds to grok at the moment, but appears to be a filter excluding is part of PEGParser
cordially, t |
Free forum by Nabble | Edit this page |