specifiying the character class range for some funky characters from İstanbul

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

specifiying the character class range for some funky characters from İstanbul

Squeak - Dev mailing list
Hi Folks,

My parser rules are not being invoked for certain character classes.



The PEG specifieds its own grammar and in that there is a "regex" character classes defined thusly.

Escape <- BACKSLASH [x] [0-9A-F]{6}


which specifies an '\' followed by an 'x' followed by 6 characters in the 0-9 and A-F ranges.  i.e. \x000FOO  


I am guessing, but do not know, that I need a character class similar to the above that will handle the funky twirly above the "I" in İstanbul

I have been using a code smell rule I call DotNot i.e. not the dot that has clearly outlived its usefullness....

DotNot <- [a-zA-Z0-9_\s\t\-\+\.\;\:\"\&\#\?\%\!\<\>\/\,\=\''\`\(\)\w]  

That rule is used by other rules, to say "accept these characters" and for the links in the linked example that rule looks like:

LinkFreeCaptioned <- OPEN_BRACKET{2} DotNot*  PipeCaption

which does a great job on English, but barfs on Instanbul

[[İstanbul Büyükşehir Belediyespor (basketball)|Basketball]]
As you can see, the funky "I" is not in DotNot.


I have flailed around aimlessly here: https://regexr.com/ to no avail.

Pointers appreciated.

cordially,





Reply | Threaded
Open this post in threaded view
|

Re: specifiying the character class range for some funky characters from İstanbul

Tobias Pape


> On 10. Jan 2021, at 14:17, gettimothy via Squeak-dev <[hidden email]> wrote:
>
> Hi Folks,
>
> My parser rules are not being invoked for certain character classes.
>
> For example, look at the İstanbul at this link:  https://en.wikipedia.org/w/index.php?title=Template:%C4%B0stanbul_B%C3%BCy%C3%BCk%C5%9Fehir_Belediyesi_sections&action=edit
>
>
> The PEG specifieds its own grammar and in that there is a "regex" character classes defined thusly.
>
> Escape <- BACKSLASH [x] [0-9A-F]{6}
>
>
> which specifies an '\' followed by an 'x' followed by 6 characters in the 0-9 and A-F ranges.  i.e. \x000FOO  
>
>
> I am guessing, but do not know, that I need a character class similar to the above that will handle the funky twirly above the "I" in İstanbul
>
> I have been using a code smell rule I call DotNot i.e. not the dot that has clearly outlived its usefullness....
>
> DotNot <- [a-zA-Z0-9_\s\t\-\+\.\;\:\"\&\#\?\%\!\<\>\/\,\=\''\`\(\)\w]  

I think the \w does not do here what you think.

What happens is that the upper case I with dot above is encoded as UTF-8-Sequences-Percent-Encoded.
So you need a parser that is (a) aware of URL percent-escaping and (b) unicode/utf-8.
That is not what you DotNot does. It can only ascii, I presume…

What kind of Regex-lib do you use?


Best regards
        -Tobias

>
> That rule is used by other rules, to say "accept these characters" and for the links in the linked example that rule looks like:
>
> LinkFreeCaptioned <- OPEN_BRACKET{2} DotNot*  PipeCaption
>
> which does a great job on English, but barfs on Instanbul
>
> [[İstanbul Büyükşehir Belediyespor (basketball)|Basketball]]
> As you can see, the funky "I" is not in DotNot.
>
>
> I have flailed around aimlessly here: https://regexr.com/ to no avail.
>
> Pointers appreciated.
>
> cordially,
>
>
>
>



Reply | Threaded
Open this post in threaded view
|

Re: specifiying the character class range for some funky characters from İstanbul

Squeak - Dev mailing list
Hi Tobias.


Thanks for the reply.

I think the \w does not do here what you think.

What happens is that the upper case I with dot above is encoded as UTF-8-Sequences-Percent-Encoded.
So you need a parser that is (a) aware of URL percent-escaping and (b) unicode/utf-8.
That is not what you DotNot does. It can only ascii, I presume…

What kind of Regex-lib do you use?


I have no idea. 

I have basically inferred the functionality of the Grammar as I go with valuable insight from Levente.

There are a couple of PEG Grammar rules in Xtreams-Parsing that uses the character class to define some rules, example:


whitespace <- [\s\t\n\r]

Identifier <- [a-zA-Z_] [a-zA-Z0-9_]*

NumLiteral <- "Infinity" / "0" / [1-9] [0-9]*

Escape <- BACKSLASH [x] [0-9A-F]{6} / BACKSLASH [nrts\-\\\[\]\''\"] / EscapeError

So, whatever Xtreams or Squeak use for character classes? I have no idea.





Reply | Threaded
Open this post in threaded view
|

Re: specifiying the character class range for some funky characters from İstanbul

Squeak - Dev mailing list
In reply to this post by Tobias Pape
poking around in the grammar some more, it defines RangeSet with a rule


Range <- OPEN_BRACKET s "^"? RangeSet{1,CLOSE_BRACKET}
That is tied to a callback via a pragma...


Range: excluding sets: sets
<action: 'Range' arguments: #( 3 4 )>

sets isEmpty ifTrue: [
^excluding
ifNil: [self DOT]
ifNotNil: [ [parser not: [parser anything]] ]].
^excluding
ifNil: [ [parser including: sets] ]
ifNotNil: [ [parser excluding: sets] ]


Which is too deep in the weeds to grok at the moment, but appears to be a filter

excluding is part of PEGParser

excluding: intervals
| position integer |
position := stream position.
[stream read: 1 into: cache at: 1] on: Incomplete do: [stream position: position. ^Failure].
integer := (cache at: 1) asInteger.
intervals do: [:interval | (interval includes: integer) ifTrue: [stream position: position. ^Failure]].
^cache at: 1



cordially,

t