Smalltalk › Squeak › Squeak - Dev

Unicode Character “à” (U+00E0) and XTreams-Parsing and just ignore the combining mark sequences?

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

1 message

Squeak - Dev mailing list

Unicode Character “à” (U+00E0) and XTreams-Parsing and just ignore the combining mark sequences?

Hi folks,

TL;DR in XTreams-Parsing do I need to add support to account for the "combining mark" as described in this regex tutorial here: https://www.regular-expressions.info/unicode.html

The Unicode code point U+0300 (grave accent) is a "combining mark"
Any code point that is not a combining mark can be followed by any number of combining marks.
This sequence, like U+0061 U+0300 above, is displayed as a single grapheme on the screen.

per: https://www.compart.com/en/unicode/U+00E0 “à” can be encoded several ways:

UTF-8 Encoding:	`0xC3 0xA0`
UTF-16 Encoding:	`0x00E0`
UTF-32 Encoding:	`0x000000E0`

I assume that the sequence 0xC3 0xA0 is the combination the regex dude refers to.

Here are some relevant Printit (values render correctly in Squeak with unifont installed, not so much in the browser where I print them)

Character codePoint:224
Character value: 16r0000E0

Character value: 16r0061
Character value: 16r0300
(Character value: 16r0061) asString , (Character value: 16r0300) asString

Escape: backslash character: character hexes: hexes
<action: 'Escape' arguments: #( 1 2 3 )>

backslash = '\' ifTrue:
[character = $s ifTrue: [^Character space].
character = $t ifTrue: [^Character tab].
character = $n ifTrue: [^Character cr].
character = $r ifTrue: [^Character lf].
character = $x ifTrue: [^('16r', (String withAll: hexes)) asNumber asCharacter]].
^character
Character value: 16r00C3
Character value: 16r00A0
(Character value: 16r00C1) asString , (Character value: 16r00A0) asString

The reason I ask is that just as Character does not (nor should it?) support combining marks

Neither does XTreams-Parsing...from the PEG grammar and the relevant callback I have the following rules:

Escape <- BACKSLASH [x] [0-9A-F]{6} / BACKSLASH [nrts\-\\\[\]\''\"] / EscapeError
EscapeError <- BACKSLASH .

with callback:

Escape: backslash character: character hexes: hexes
<action: 'Escape' arguments: #( 1 2 3 )>

backslash = '\' ifTrue:
[character = $s ifTrue: [^Character space].
character = $t ifTrue: [^Character tab].
character = $n ifTrue: [^Character cr].
character = $r ifTrue: [^Character lf].
character = $x ifTrue: [^('16r', (String withAll: hexes)) asNumber asCharacter]].
^character

which you can see does not support capture of the pair 0x00C3 0x00A0 to return “à”

I am strongly leaning towards ignoring the pairs and assuming that all characters such as above are part of the extension.

Thoughts appreciated.

thx