Unicode Character “à” (U+00E0) and XTreams-Parsing and just ignore the combining mark sequences?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Unicode Character “à” (U+00E0) and XTreams-Parsing and just ignore the combining mark sequences?

Squeak - Dev mailing list
Hi folks,

TL;DR in XTreams-Parsing do I need to add support to account for the "combining mark" as described in this regex tutorial here: https://www.regular-expressions.info/unicode.html

The Unicode code point U+0300 (grave accent) is a "combining mark"
Any code point that is not a combining mark can be followed by any number of combining marks.
This sequence, like U+0061 U+0300 above, is displayed as a single grapheme on the screen.

per: https://www.compart.com/en/unicode/U+00E0   “à” can be encoded several ways:

UTF-8 Encoding:
0xC3 0xA0
UTF-16 Encoding:
0x00E0
UTF-32 Encoding:
0x000000E0


I assume that the sequence 0xC3 0xA0  is the combination the regex dude refers to.

Here are some relevant Printit (values render correctly in Squeak with unifont installed, not so much in the browser where I print them)
Character codePoint:224
Character value: 16r0000E0

Character value: 16r0061
Character value: 16r0300
(Character value: 16r0061) asString , (Character value: 16r0300) asString

Escape: backslash character: character hexes: hexes
<action: 'Escape' arguments: #( 1 2 3 )>

backslash = '\' ifTrue:
[character = $s ifTrue: [^Character space].
character = $t ifTrue: [^Character tab].
character = $n ifTrue: [^Character cr].
character = $r ifTrue: [^Character lf].
character = $x ifTrue: [^('16r', (String withAll: hexes)) asNumber asCharacter]].
^character
Character value: 16r00C3
Character value: 16r00A0
(Character value: 16r00C1) asString , (Character value: 16r00A0) asString

The reason I ask is that just as Character does not (nor should it?) support combining marks
Neither does XTreams-Parsing...from the PEG grammar and the relevant callback I have the following rules:

Escape <- BACKSLASH [x] [0-9A-F]{6}  / BACKSLASH [nrts\-\\\[\]\''\"] / EscapeError
EscapeError <- BACKSLASH .

with callback:
Escape: backslash character: character hexes: hexes
<action: 'Escape' arguments: #( 1 2 3 )>

backslash = '\' ifTrue:
[character = $s ifTrue: [^Character space].
character = $t ifTrue: [^Character tab].
character = $n ifTrue: [^Character cr].
character = $r ifTrue: [^Character lf].
character = $x ifTrue: [^('16r', (String withAll: hexes)) asNumber asCharacter]].
^character
which you can see does not support capture of the pair 0x00C3 0x00A0 to return “à”

I am strongly leaning towards ignoring the pairs and assuming that all characters such as above are part of the extension.

Thoughts appreciated.

thx