Smalltalk › Squeak › Squeak - Dev

specifiying the character class range for some funky characters from İstanbul

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

4 messages Options

Squeak - Dev mailing list

specifiying the character class range for some funky characters from İstanbul

Hi Folks,

My parser rules are not being invoked for certain character classes.

For example, look at the İstanbul at this link: https://en.wikipedia.org/w/index.php?title=Template:%C4%B0stanbul_B%C3%BCy%C3%BCk%C5%9Fehir_Belediyesi_sections&action=edit

The PEG specifieds its own grammar and in that there is a "regex" character classes defined thusly.

Escape <- BACKSLASH [x] [0-9A-F]{6}

which specifies an '\' followed by an 'x' followed by 6 characters in the 0-9 and A-F ranges. i.e. \x000FOO

I am guessing, but do not know, that I need a character class similar to the above that will handle the funky twirly above the "I" in İstanbul

I have been using a code smell rule I call DotNot i.e. not the dot that has clearly outlived its usefullness....

DotNot <- [a-zA-Z0-9_\s\t\-\+\.\;\:\"\&\#\?\%\!\<\>\/\,\=\''\`\(\)\w]

That rule is used by other rules, to say "accept these characters" and for the links in the linked example that rule looks like:

LinkFreeCaptioned <- OPEN_BRACKET{2} DotNot* PipeCaption

which does a great job on English, but barfs on Instanbul

[[İstanbul Büyükşehir Belediyespor (basketball)|Basketball]]

As you can see, the funky "I" is not in DotNot.

I have flailed around aimlessly here: https://regexr.com/ to no avail.

Pointers appreciated.

cordially,

Tobias Pape

Re: specifiying the character class range for some funky characters from İstanbul

> On 10. Jan 2021, at 14:17, gettimothy via Squeak-dev <[hidden email]> wrote:
>
> Hi Folks,
>
> My parser rules are not being invoked for certain character classes.
>
> For example, look at the İstanbul at this link: https://en.wikipedia.org/w/index.php?title=Template:%C4%B0stanbul_B%C3%BCy%C3%BCk%C5%9Fehir_Belediyesi_sections&action=edit
>
>
> The PEG specifieds its own grammar and in that there is a "regex" character classes defined thusly.
>
> Escape <- BACKSLASH [x] [0-9A-F]{6}
>
>
> which specifies an '\' followed by an 'x' followed by 6 characters in the 0-9 and A-F ranges. i.e. \x000FOO
>
>
> I am guessing, but do not know, that I need a character class similar to the above that will handle the funky twirly above the "I" in İstanbul
>
> I have been using a code smell rule I call DotNot i.e. not the dot that has clearly outlived its usefullness....
>
> DotNot <- [a-zA-Z0-9_\s\t\-\+\.\;\:\"\&\#\?\%\!\<\>\/\,\=\''\`\(\)\w]

I think the \w does not do here what you think.

What happens is that the upper case I with dot above is encoded as UTF-8-Sequences-Percent-Encoded.
So you need a parser that is (a) aware of URL percent-escaping and (b) unicode/utf-8.
That is not what you DotNot does. It can only ascii, I presume…

What kind of Regex-lib do you use?

Best regards
-Tobias

>
> That rule is used by other rules, to say "accept these characters" and for the links in the linked example that rule looks like:
>
> LinkFreeCaptioned <- OPEN_BRACKET{2} DotNot* PipeCaption
>
> which does a great job on English, but barfs on Instanbul
>
> [[İstanbul Büyükşehir Belediyespor (basketball)|Basketball]]
> As you can see, the funky "I" is not in DotNot.
>
>
> I have flailed around aimlessly here: https://regexr.com/ to no avail.
>
> Pointers appreciated.
>
> cordially,
>
>
>
>

Squeak - Dev mailing list

Re: specifiying the character class range for some funky characters from İstanbul

Hi Tobias.

Thanks for the reply.

I think the \w does not do here what you think.

What happens is that the upper case I with dot above is encoded as UTF-8-Sequences-Percent-Encoded.
So you need a parser that is (a) aware of URL percent-escaping and (b) unicode/utf-8.
That is not what you DotNot does. It can only ascii, I presume…

What kind of Regex-lib do you use?

I have no idea.

I have basically inferred the functionality of the Grammar as I go with valuable insight from Levente.

There are a couple of PEG Grammar rules in Xtreams-Parsing that uses the character class to define some rules, example:

whitespace <- [\s\t\n\r]

Identifier <- [a-zA-Z_] [a-zA-Z0-9_]*

NumLiteral <- "Infinity" / "0" / [1-9] [0-9]*

Escape <- BACKSLASH [x] [0-9A-F]{6} / BACKSLASH [nrts\-\\\[\]\''\"] / EscapeError

So, whatever Xtreams or Squeak use for character classes? I have no idea.

Squeak - Dev mailing list

Re: specifiying the character class range for some funky characters from İstanbul

In reply to this post by Tobias Pape

poking around in the grammar some more, it defines RangeSet with a rule

Range <- OPEN_BRACKET s "^"? RangeSet{1,CLOSE_BRACKET}

That is tied to a callback via a pragma...

Range: excluding sets: sets
<action: 'Range' arguments: #( 3 4 )>

sets isEmpty ifTrue: [
^excluding
ifNil: [self DOT]
ifNotNil: [ [parser not: [parser anything]] ]].
^excluding
ifNil: [ [parser including: sets] ]
ifNotNil: [ [parser excluding: sets] ]

Which is too deep in the weeds to grok at the moment, but appears to be a filter

excluding is part of PEGParser

excluding: intervals
| position integer |
position := stream position.
[stream read: 1 into: cache at: 1] on: Incomplete do: [stream position: position. ^Failure].
integer := (cache at: 1) asInteger.
intervals do: [:interval | (interval includes: integer) ifTrue: [stream position: position. ^Failure]].
^cache at: 1

cordially,