The Unicode code point U+0300 (grave accent) is a "combining mark"
Any code point that is not a combining mark can be followed by any number of combining marks.
This sequence, like U+0061 U+0300 above, is displayed as a single grapheme on the screen.
UTF-8 Encoding:
| 0xC3 0xA0
|
UTF-16 Encoding:
| 0x00E0
|
UTF-32 Encoding:
| 0x000000E0
|
I assume that the sequence 0xC3 0xA0 is the combination the regex dude refers to.
Here are some relevant Printit (values render correctly in Squeak with unifont installed, not so much in the browser where I print them)
Character value: 16r0000E0
Character value: 16r0061
Character value: 16r0300
(Character value: 16r0061) asString , (Character value: 16r0300) asString
Escape: backslash character: character hexes: hexes
<action: 'Escape' arguments: #( 1 2 3 )>
backslash = '\' ifTrue:
[character = $s ifTrue: [^Character space].
character = $t ifTrue: [^Character tab].
character = $n ifTrue: [^Character cr].
character = $r ifTrue: [^Character lf].
character = $x ifTrue: [^('16r', (String withAll: hexes)) asNumber asCharacter]].
^character
Character value: 16r00C3
Character value: 16r00A0
(Character value: 16r00C1) asString , (Character value: 16r00A0) asString
The reason I ask is that just as Character does not (nor should it?) support combining marks
Neither does XTreams-Parsing...from the PEG grammar and the relevant callback I have the following rules:
Escape <- BACKSLASH [x] [0-9A-F]{6} / BACKSLASH [nrts\-\\\[\]\''\"] / EscapeError
EscapeError <- BACKSLASH .
with callback:
Escape: backslash character: character hexes: hexes
<action: 'Escape' arguments: #( 1 2 3 )>
backslash = '\' ifTrue:
[character = $s ifTrue: [^Character space].
character = $t ifTrue: [^Character tab].
character = $n ifTrue: [^Character cr].
character = $r ifTrue: [^Character lf].
character = $x ifTrue: [^('16r', (String withAll: hexes)) asNumber asCharacter]].
^character
which you can see does not support capture of the pair 0x00C3 0x00A0 to return “à”
I am strongly leaning towards ignoring the pairs and assuming that all characters such as above are part of the extension.
Thoughts appreciated.
thx