I'm currently groping my way to seeing how feature-complete our
Unicode support is. I am doing this to establish what still needs to be done to provide full Unicode support. This seems to me to be an area where it would be best to write it once, and then have the same codebase incorporated into the Smalltalks that most share a common ancestry. I am keen to get: equality-testing for strings; sortability for strings which have ligatures and diacritic characters; and correct round-tripping of data. Call to action: ========== If you have comments on these proposals - such as "but we already have that facility" or "the reason we do not have these facilities is because they are dog-slow" - please let me know them. If you would like to help out, please let me know. If you have Unicode experience and expertise, and would like to be, or would be willing to be, in the 'council of experts' for this project, please let me know. If you have comments or ideas on anything mentioned in this email In the first instance, the initiative's website will be: http://smalltalk.uk.to/unicode.html I have created a SqueakSource.com project called UnicodeSupport I want to avoid re-inventing any facilities which already exist. Except where they prevent us reaching the goals of: - sortable UTF8 strings - sortable UTF16 strings - equivalence testing of 2 UTF8 strings - equivalence testing of 2 UTF16 strings - round-tripping UTF8 strings through Smalltalk - roundtripping UTF16 strings through Smalltalk. As I understand it, we have limited Unicode support atm. Current state of play =============== ByteString gets converted to WideString when need is automagically detected. Is there anything else that currently exists? Definition of Terms ============== A quick definition of terms before I go any further: Standard terms from the Unicode standard =============================== a compatibility character : an additional encoding of a *normal* character, for compatibility and round-trip conversion purposes. For instance, a 1-byte encoding of a Latin character with a diacritic. Made-up terms ============ a convenience codepoint : a single codepoint which represents an item that is also encoded as a string of codepoints. (I tend to use the terms compatibility character and compatibility codepoint interchangably. The standard only refers to them as compatibility characters. However, the standard is determined to emphasise that characters are abstract and that codepoints are concrete. So I think it is often more useful and productive to think of compatibility or convenience codepoints). a composed character : a character made up of several codepoints Unicode encoding explained ===================== A convenience codepoint can therefore be thought of as a code point used for a character which also has a composed form. The way Unicode works is that sometimes you can encode a character in one byte, sometimes not. Sometimes you can encode it in two bytes, sometimes not. You can therefore have a long stream of ASCII which is single-byte Unicode. If there is an occasional Cyrillic or Greek character in the stream, it would be represented either by a compatibility character or by a multi-byte combination. Using compatibility characters can prevent proper sorting and equivalence testing. Using "pure" Unicode, ie. "normal encodings", can cause compatibility and round-tripping probelms. Although avoiding them can *also* cause compatibility issues and round-tripping problems. Currently my thinking is: a Utf8String class an Ordered collection, with 1 byte characters as the modal element, but short arrays of wider strings where necessary a Utf16String class an Ordered collection, with 2 byte characters as the modal element, but short arrays of wider strings beginning with a 2-byte endianness indicator. Utf8Strings sometimes need to be sortable, and sometimes need to be compatible. So my thinking is that Utf8String will contain convenience codepoints, for round-tripping. And where there are multiple convenience codepoints for a character, that it standardises on one. And that there is a Utf8SortableString which uses *only* normal characters. We then need methods to convert between the two. aUtf8String asUtf8SortableString and aUtf8SortableString asUtf8String Sort orders are culture and context dependent - Sweden and Germany have different sort orders for the same diacritic-ed characters. Some countries have one order in general usage, and another for specific usages, such as phone directories (e.g. UK and France) Similarly for Utf16 : Utf16String and Utf16SortableString and conversion methods A list of sorted words would be a SortedCollection, and there could be pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder, seOrder, ukOrder, etc along the lines of aListOfWords := SortedCollection sortBlock: deOrder If a word is either a Utf8SortableString, or a well-formed Utf8String, then we can perform equivalence testing on them trivially. To make sure a Utf8String is well formed, we would need to have a way of cleaning up any convenience codepoints which were valid, but which were for a character which has multiple equally-valid alternative convenience codepoints, and for which the string currently had the "wrong" convenience codepoint. (i.e for any character with valid alternative convenience codepoints, we would choose one to be in the well-formed Utf8String, and we would need a method for cleaning the alternative convenience codepoints out of the string, and replacing them with the chosen approved convenience codepoint. aUtf8String cleanUtf8String With WideString, a lot of the issues disappear - except round-tripping(although I'm sure I have seen something recently about 4-byte strings that also have an additional bit. Which would make some Unicode characters 5-bytes long.) (I'm starting to zone out now - if I've overlooked anything - obvious, subtle, or somewhere in between, please let me know) Cheers, Euan |
Why would you want to have strings with UTF-8 or UTF-16 encoding in the
image? What's wrong with the current UTF-32 representation? Levente On Fri, 4 Dec 2015, EuanM wrote: > I'm currently groping my way to seeing how feature-complete our > Unicode support is. I am doing this to establish what still needs to > be done to provide full Unicode support. > > This seems to me to be an area where it would be best to write it > once, and then have the same codebase incorporated into the Smalltalks > that most share a common ancestry. > > I am keen to get: equality-testing for strings; sortability for > strings which have ligatures and diacritic characters; and correct > round-tripping of data. > > Call to action: > ========== > > If you have comments on these proposals - such as "but we already have > that facility" or "the reason we do not have these facilities is > because they are dog-slow" - please let me know them. > > If you would like to help out, please let me know. > > If you have Unicode experience and expertise, and would like to be, or > would be willing to be, in the 'council of experts' for this project, > please let me know. > > If you have comments or ideas on anything mentioned in this email > > In the first instance, the initiative's website will be: > http://smalltalk.uk.to/unicode.html > > I have created a SqueakSource.com project called UnicodeSupport > > I want to avoid re-inventing any facilities which already exist. > Except where they prevent us reaching the goals of: > - sortable UTF8 strings > - sortable UTF16 strings > - equivalence testing of 2 UTF8 strings > - equivalence testing of 2 UTF16 strings > - round-tripping UTF8 strings through Smalltalk > - roundtripping UTF16 strings through Smalltalk. > As I understand it, we have limited Unicode support atm. > > Current state of play > =============== > ByteString gets converted to WideString when need is automagically detected. > > Is there anything else that currently exists? > > Definition of Terms > ============== > A quick definition of terms before I go any further: > > Standard terms from the Unicode standard > =============================== > a compatibility character : an additional encoding of a *normal* > character, for compatibility and round-trip conversion purposes. For > instance, a 1-byte encoding of a Latin character with a diacritic. > > Made-up terms > ============ > a convenience codepoint : a single codepoint which represents an item > that is also encoded as a string of codepoints. > > (I tend to use the terms compatibility character and compatibility > codepoint interchangably. The standard only refers to them as > compatibility characters. However, the standard is determined to > emphasise that characters are abstract and that codepoints are > concrete. So I think it is often more useful and productive to think > of compatibility or convenience codepoints). > > a composed character : a character made up of several codepoints > > Unicode encoding explained > ===================== > A convenience codepoint can therefore be thought of as a code point > used for a character which also has a composed form. > > The way Unicode works is that sometimes you can encode a character in > one byte, sometimes not. Sometimes you can encode it in two bytes, > sometimes not. > > You can therefore have a long stream of ASCII which is single-byte > Unicode. If there is an occasional Cyrillic or Greek character in the > stream, it would be represented either by a compatibility character or > by a multi-byte combination. > > Using compatibility characters can prevent proper sorting and > equivalence testing. > > Using "pure" Unicode, ie. "normal encodings", can cause compatibility > and round-tripping probelms. Although avoiding them can *also* cause > compatibility issues and round-tripping problems. > > Currently my thinking is: > > a Utf8String class > an Ordered collection, with 1 byte characters as the modal element, > but short arrays of wider strings where necessary > a Utf16String class > an Ordered collection, with 2 byte characters as the modal element, > but short arrays of wider strings > beginning with a 2-byte endianness indicator. > > Utf8Strings sometimes need to be sortable, and sometimes need to be compatible. > > So my thinking is that Utf8String will contain convenience codepoints, > for round-tripping. And where there are multiple convenience > codepoints for a character, that it standardises on one. > > And that there is a Utf8SortableString which uses *only* normal characters. > > We then need methods to convert between the two. > > aUtf8String asUtf8SortableString > > and > > aUtf8SortableString asUtf8String > > > Sort orders are culture and context dependent - Sweden and Germany > have different sort orders for the same diacritic-ed characters. Some > countries have one order in general usage, and another for specific > usages, such as phone directories (e.g. UK and France) > > Similarly for Utf16 : Utf16String and Utf16SortableString and > conversion methods > > A list of sorted words would be a SortedCollection, and there could be > pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder, > seOrder, ukOrder, etc > > along the lines of > aListOfWords := SortedCollection sortBlock: deOrder > > If a word is either a Utf8SortableString, or a well-formed Utf8String, > then we can perform equivalence testing on them trivially. > > To make sure a Utf8String is well formed, we would need to have a way > of cleaning up any convenience codepoints which were valid, but which > were for a character which has multiple equally-valid alternative > convenience codepoints, and for which the string currently had the > "wrong" convenience codepoint. (i.e for any character with valid > alternative convenience codepoints, we would choose one to be in the > well-formed Utf8String, and we would need a method for cleaning the > alternative convenience codepoints out of the string, and replacing > them with the chosen approved convenience codepoint. > > aUtf8String cleanUtf8String > > With WideString, a lot of the issues disappear - except > round-tripping(although I'm sure I have seen something recently about > 4-byte strings that also have an additional bit. Which would make > some Unicode characters 5-bytes long.) > > > (I'm starting to zone out now - if I've overlooked anything - obvious, > subtle, or somewhere in between, please let me know) > > Cheers, > Euan > > |
In reply to this post by EuanM
Euan,
When you encode a Unicode*String into UTF8, it should be in the form of a ByteArray -- in GemStone we have a Utf8 class that is a subclass of ByteArray that stores the encoded bytes -- If you use a String to store a Utf8 encoded String, then you run the risk of getting confused as to whether or not the String is encoded ... a separate class eliminates this potential confusion... My $0.02... Dale On 12/04/2015 03:42 AM, EuanM wrote: > I'm currently groping my way to seeing how feature-complete our > Unicode support is. I am doing this to establish what still needs to > be done to provide full Unicode support. > > This seems to me to be an area where it would be best to write it > once, and then have the same codebase incorporated into the Smalltalks > that most share a common ancestry. > > I am keen to get: equality-testing for strings; sortability for > strings which have ligatures and diacritic characters; and correct > round-tripping of data. > > Call to action: > ========== > > If you have comments on these proposals - such as "but we already have > that facility" or "the reason we do not have these facilities is > because they are dog-slow" - please let me know them. > > If you would like to help out, please let me know. > > If you have Unicode experience and expertise, and would like to be, or > would be willing to be, in the 'council of experts' for this project, > please let me know. > > If you have comments or ideas on anything mentioned in this email > > In the first instance, the initiative's website will be: > http://smalltalk.uk.to/unicode.html > > I have created a SqueakSource.com project called UnicodeSupport > > I want to avoid re-inventing any facilities which already exist. > Except where they prevent us reaching the goals of: > - sortable UTF8 strings > - sortable UTF16 strings > - equivalence testing of 2 UTF8 strings > - equivalence testing of 2 UTF16 strings > - round-tripping UTF8 strings through Smalltalk > - roundtripping UTF16 strings through Smalltalk. > As I understand it, we have limited Unicode support atm. > > Current state of play > =============== > ByteString gets converted to WideString when need is automagically detected. > > Is there anything else that currently exists? > > Definition of Terms > ============== > A quick definition of terms before I go any further: > > Standard terms from the Unicode standard > =============================== > a compatibility character : an additional encoding of a *normal* > character, for compatibility and round-trip conversion purposes. For > instance, a 1-byte encoding of a Latin character with a diacritic. > > Made-up terms > ============ > a convenience codepoint : a single codepoint which represents an item > that is also encoded as a string of codepoints. > > (I tend to use the terms compatibility character and compatibility > codepoint interchangably. The standard only refers to them as > compatibility characters. However, the standard is determined to > emphasise that characters are abstract and that codepoints are > concrete. So I think it is often more useful and productive to think > of compatibility or convenience codepoints). > > a composed character : a character made up of several codepoints > > Unicode encoding explained > ===================== > A convenience codepoint can therefore be thought of as a code point > used for a character which also has a composed form. > > The way Unicode works is that sometimes you can encode a character in > one byte, sometimes not. Sometimes you can encode it in two bytes, > sometimes not. > > You can therefore have a long stream of ASCII which is single-byte > Unicode. If there is an occasional Cyrillic or Greek character in the > stream, it would be represented either by a compatibility character or > by a multi-byte combination. > > Using compatibility characters can prevent proper sorting and > equivalence testing. > > Using "pure" Unicode, ie. "normal encodings", can cause compatibility > and round-tripping probelms. Although avoiding them can *also* cause > compatibility issues and round-tripping problems. > > Currently my thinking is: > > a Utf8String class > an Ordered collection, with 1 byte characters as the modal element, > but short arrays of wider strings where necessary > a Utf16String class > an Ordered collection, with 2 byte characters as the modal element, > but short arrays of wider strings > beginning with a 2-byte endianness indicator. > > Utf8Strings sometimes need to be sortable, and sometimes need to be compatible. > > So my thinking is that Utf8String will contain convenience codepoints, > for round-tripping. And where there are multiple convenience > codepoints for a character, that it standardises on one. > > And that there is a Utf8SortableString which uses *only* normal characters. > > We then need methods to convert between the two. > > aUtf8String asUtf8SortableString > > and > > aUtf8SortableString asUtf8String > > > Sort orders are culture and context dependent - Sweden and Germany > have different sort orders for the same diacritic-ed characters. Some > countries have one order in general usage, and another for specific > usages, such as phone directories (e.g. UK and France) > > Similarly for Utf16 : Utf16String and Utf16SortableString and > conversion methods > > A list of sorted words would be a SortedCollection, and there could be > pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder, > seOrder, ukOrder, etc > > along the lines of > aListOfWords := SortedCollection sortBlock: deOrder > > If a word is either a Utf8SortableString, or a well-formed Utf8String, > then we can perform equivalence testing on them trivially. > > To make sure a Utf8String is well formed, we would need to have a way > of cleaning up any convenience codepoints which were valid, but which > were for a character which has multiple equally-valid alternative > convenience codepoints, and for which the string currently had the > "wrong" convenience codepoint. (i.e for any character with valid > alternative convenience codepoints, we would choose one to be in the > well-formed Utf8String, and we would need a method for cleaning the > alternative convenience codepoints out of the string, and replacing > them with the chosen approved convenience codepoint. > > aUtf8String cleanUtf8String > > With WideString, a lot of the issues disappear - except > round-tripping(although I'm sure I have seen something recently about > 4-byte strings that also have an additional bit. Which would make > some Unicode characters 5-bytes long.) > > > (I'm starting to zone out now - if I've overlooked anything - obvious, > subtle, or somewhere in between, please let me know) > > Cheers, > Euan > |
In reply to this post by Levente Uzonyi
> On 04-12-2015, at 6:46 AM, Levente Uzonyi <[hidden email]> wrote: > > Why would you want to have strings with UTF-8 or UTF-16 encoding in the image? > What's wrong with the current UTF-32 representation? WideStrings are perfectly ok most of the time, as are plain old byte String. Where things get a bit awkward is when interfacing to code that requires UTF8, such as Cairo/Pango and some OS interfaces. Currently we can have simple byte String and edit in or append a wide character and all works properly; a WideString is made, everything gets sorted out. Well, everything I’ve had to try out for the Pi Scratch project. The problem is in having to convert too often; for example every rendering operation requires a conversion from Squeak format to UTF8. Some file reading operations require conversion from utf8 to squeak. One idea I had but haven’t done anything with yet is to make a class that keeps both formats around to effectively cache the utf8. It isn’t needed for anywhere near all Strings. All editing/sorting would work on the squeak format part and after each edit the conversion would be done, or possibly the utf version flushed to cause a new conversion if it were ever requested. tim -- tim Rowledge; [hidden email]; http://www.rowledge.org/tim Useful Latin Phrases:- Sentio aliquos togatos contra me conspirare = I think some people in togas are plotting against me. |
In reply to this post by EuanM
This may be out of the scope of your project, but there is also the issue that Squeak/Pharo don't display most characters. Copy the following in a workspace and only the first line is rendered properly. At a minimum there should be font substitution happening when the current font doesn't contain the necessary glyphs.
Welcome καλωσόρισμα добро пожаловать בברכה أهلا بك स्वागत ยินดีต้อนรับ ようこそ 欢迎 환영 ❄ 𝄞 On Fri, Dec 4, 2015 at 3:42 AM, EuanM <[hidden email]> wrote: I'm currently groping my way to seeing how feature-complete our |
> On 05-12-2015, at 11:10 AM, Ryan Macnak <[hidden email]> wrote: > > This may be out of the scope of your project, but there is also the issue that Squeak/Pharo don't display most characters. Copy the following in a workspace and only the first line is rendered properly. At a minimum there should be font substitution happening when the current font doesn't contain the necessary glyphs. We’ll need to properly complete an interface to Cairo/Pango or perhaps even better, incorporate Nile. In Pi Scratch I have moderately tacky-hack to make all text rendering go via Pango if the relevant plugin is present. It is surprisingly effective, with no noticeable impact on the text rendering performance *within Scratch*. I don’t know what it might do to performance for general dev ui tool text rendering; clearly large volumes of text in multi-paragraph chunks is not an issue in Scratch. tim -- tim Rowledge; [hidden email]; http://www.rowledge.org/tim Useful random insult:- IQ = dx / (1 + dx), where x = age. |
In reply to this post by Levente Uzonyi
First, what's UTF-32? Second, we have the whole language tag thing that nobody else uses.
Finally, UTF-8 is a great encoding that certain kinds of applications really ought to use. Web apps, in particular, benefit from using UTF-8 so the don't have to decode and then re-encode strings coming in from the network. In DabbleDB we used UTF-8 encoded string in the image, and just ignored the fact that they were displayed incorrectly by inspectors. Having a proper UTF-8 string class would be useful. - Colin > On Dec 4, 2015, at 6:46 AM, Levente Uzonyi <[hidden email]> wrote: > > Why would you want to have strings with UTF-8 or UTF-16 encoding in the image? > What's wrong with the current UTF-32 representation? > > Levente > >> On Fri, 4 Dec 2015, EuanM wrote: >> >> I'm currently groping my way to seeing how feature-complete our >> Unicode support is. I am doing this to establish what still needs to >> be done to provide full Unicode support. >> >> This seems to me to be an area where it would be best to write it >> once, and then have the same codebase incorporated into the Smalltalks >> that most share a common ancestry. >> >> I am keen to get: equality-testing for strings; sortability for >> strings which have ligatures and diacritic characters; and correct >> round-tripping of data. >> >> Call to action: >> ========== >> >> If you have comments on these proposals - such as "but we already have >> that facility" or "the reason we do not have these facilities is >> because they are dog-slow" - please let me know them. >> >> If you would like to help out, please let me know. >> >> If you have Unicode experience and expertise, and would like to be, or >> would be willing to be, in the 'council of experts' for this project, >> please let me know. >> >> If you have comments or ideas on anything mentioned in this email >> >> In the first instance, the initiative's website will be: >> http://smalltalk.uk.to/unicode.html >> >> I have created a SqueakSource.com project called UnicodeSupport >> >> I want to avoid re-inventing any facilities which already exist. >> Except where they prevent us reaching the goals of: >> - sortable UTF8 strings >> - sortable UTF16 strings >> - equivalence testing of 2 UTF8 strings >> - equivalence testing of 2 UTF16 strings >> - round-tripping UTF8 strings through Smalltalk >> - roundtripping UTF16 strings through Smalltalk. >> As I understand it, we have limited Unicode support atm. >> >> Current state of play >> =============== >> ByteString gets converted to WideString when need is automagically detected. >> >> Is there anything else that currently exists? >> >> Definition of Terms >> ============== >> A quick definition of terms before I go any further: >> >> Standard terms from the Unicode standard >> =============================== >> a compatibility character : an additional encoding of a *normal* >> character, for compatibility and round-trip conversion purposes. For >> instance, a 1-byte encoding of a Latin character with a diacritic. >> >> Made-up terms >> ============ >> a convenience codepoint : a single codepoint which represents an item >> that is also encoded as a string of codepoints. >> >> (I tend to use the terms compatibility character and compatibility >> codepoint interchangably. The standard only refers to them as >> compatibility characters. However, the standard is determined to >> emphasise that characters are abstract and that codepoints are >> concrete. So I think it is often more useful and productive to think >> of compatibility or convenience codepoints). >> >> a composed character : a character made up of several codepoints >> >> Unicode encoding explained >> ===================== >> A convenience codepoint can therefore be thought of as a code point >> used for a character which also has a composed form. >> >> The way Unicode works is that sometimes you can encode a character in >> one byte, sometimes not. Sometimes you can encode it in two bytes, >> sometimes not. >> >> You can therefore have a long stream of ASCII which is single-byte >> Unicode. If there is an occasional Cyrillic or Greek character in the >> stream, it would be represented either by a compatibility character or >> by a multi-byte combination. >> >> Using compatibility characters can prevent proper sorting and >> equivalence testing. >> >> Using "pure" Unicode, ie. "normal encodings", can cause compatibility >> and round-tripping probelms. Although avoiding them can *also* cause >> compatibility issues and round-tripping problems. >> >> Currently my thinking is: >> >> a Utf8String class >> an Ordered collection, with 1 byte characters as the modal element, >> but short arrays of wider strings where necessary >> a Utf16String class >> an Ordered collection, with 2 byte characters as the modal element, >> but short arrays of wider strings >> beginning with a 2-byte endianness indicator. >> >> Utf8Strings sometimes need to be sortable, and sometimes need to be compatible. >> >> So my thinking is that Utf8String will contain convenience codepoints, >> for round-tripping. And where there are multiple convenience >> codepoints for a character, that it standardises on one. >> >> And that there is a Utf8SortableString which uses *only* normal characters. >> >> We then need methods to convert between the two. >> >> aUtf8String asUtf8SortableString >> >> and >> >> aUtf8SortableString asUtf8String >> >> >> Sort orders are culture and context dependent - Sweden and Germany >> have different sort orders for the same diacritic-ed characters. Some >> countries have one order in general usage, and another for specific >> usages, such as phone directories (e.g. UK and France) >> >> Similarly for Utf16 : Utf16String and Utf16SortableString and >> conversion methods >> >> A list of sorted words would be a SortedCollection, and there could be >> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder, >> seOrder, ukOrder, etc >> >> along the lines of >> aListOfWords := SortedCollection sortBlock: deOrder >> >> If a word is either a Utf8SortableString, or a well-formed Utf8String, >> then we can perform equivalence testing on them trivially. >> >> To make sure a Utf8String is well formed, we would need to have a way >> of cleaning up any convenience codepoints which were valid, but which >> were for a character which has multiple equally-valid alternative >> convenience codepoints, and for which the string currently had the >> "wrong" convenience codepoint. (i.e for any character with valid >> alternative convenience codepoints, we would choose one to be in the >> well-formed Utf8String, and we would need a method for cleaning the >> alternative convenience codepoints out of the string, and replacing >> them with the chosen approved convenience codepoint. >> >> aUtf8String cleanUtf8String >> >> With WideString, a lot of the issues disappear - except >> round-tripping(although I'm sure I have seen something recently about >> 4-byte strings that also have an additional bit. Which would make >> some Unicode characters 5-bytes long.) >> >> >> (I'm starting to zone out now - if I've overlooked anything - obvious, >> subtle, or somewhere in between, please let me know) >> >> Cheers, >> Euan >> >> > |
On Sat, 5 Dec 2015, Colin Putney wrote:
> First, what's UTF-32? Second, we have the whole language tag thing that nobody else uses. In Squeak, Strings use UTF-32 encoding[1]. It's straightforward to see for WideString, but ByteString is just a subset of WideString, so it uses the same encoding. We also use language tags, but that's a different story. Language tags make it possible to work around the problems introduced by the Han unification[2]. We shouldn't really use them for non-CJKV languages. > > Finally, UTF-8 is a great encoding that certain kinds of applications really ought to use. Web apps, in particular, benefit from using UTF-8 so the don't have to decode and then re-encode strings coming in from the network. In DabbleDB we used UTF-8 encoded string in the image, and just ignored the fact that they were displayed incorrectly by inspectors. Having a proper UTF-8 string class would be useful. We do the same thing, but that doesn't mean it's a good idea to create a new String-like class having its content encoded in UTF-8, because UTF-8-encoded strings can't be modified like regular strings. While it would be possible to implement all operations, such implementation would become the next SortedCollection (bad performance due to misuse). Levente [1] https://en.wikipedia.org/wiki/UTF-32 [2] https://en.wikipedia.org/wiki/Han_unification > > - Colin > > >> On Dec 4, 2015, at 6:46 AM, Levente Uzonyi <[hidden email]> wrote: >> >> Why would you want to have strings with UTF-8 or UTF-16 encoding in the image? >> What's wrong with the current UTF-32 representation? >> >> Levente >> >>> On Fri, 4 Dec 2015, EuanM wrote: >>> >>> I'm currently groping my way to seeing how feature-complete our >>> Unicode support is. I am doing this to establish what still needs to >>> be done to provide full Unicode support. >>> >>> This seems to me to be an area where it would be best to write it >>> once, and then have the same codebase incorporated into the Smalltalks >>> that most share a common ancestry. >>> >>> I am keen to get: equality-testing for strings; sortability for >>> strings which have ligatures and diacritic characters; and correct >>> round-tripping of data. >>> >>> Call to action: >>> ========== >>> >>> If you have comments on these proposals - such as "but we already have >>> that facility" or "the reason we do not have these facilities is >>> because they are dog-slow" - please let me know them. >>> >>> If you would like to help out, please let me know. >>> >>> If you have Unicode experience and expertise, and would like to be, or >>> would be willing to be, in the 'council of experts' for this project, >>> please let me know. >>> >>> If you have comments or ideas on anything mentioned in this email >>> >>> In the first instance, the initiative's website will be: >>> http://smalltalk.uk.to/unicode.html >>> >>> I have created a SqueakSource.com project called UnicodeSupport >>> >>> I want to avoid re-inventing any facilities which already exist. >>> Except where they prevent us reaching the goals of: >>> - sortable UTF8 strings >>> - sortable UTF16 strings >>> - equivalence testing of 2 UTF8 strings >>> - equivalence testing of 2 UTF16 strings >>> - round-tripping UTF8 strings through Smalltalk >>> - roundtripping UTF16 strings through Smalltalk. >>> As I understand it, we have limited Unicode support atm. >>> >>> Current state of play >>> =============== >>> ByteString gets converted to WideString when need is automagically detected. >>> >>> Is there anything else that currently exists? >>> >>> Definition of Terms >>> ============== >>> A quick definition of terms before I go any further: >>> >>> Standard terms from the Unicode standard >>> =============================== >>> a compatibility character : an additional encoding of a *normal* >>> character, for compatibility and round-trip conversion purposes. For >>> instance, a 1-byte encoding of a Latin character with a diacritic. >>> >>> Made-up terms >>> ============ >>> a convenience codepoint : a single codepoint which represents an item >>> that is also encoded as a string of codepoints. >>> >>> (I tend to use the terms compatibility character and compatibility >>> codepoint interchangably. The standard only refers to them as >>> compatibility characters. However, the standard is determined to >>> emphasise that characters are abstract and that codepoints are >>> concrete. So I think it is often more useful and productive to think >>> of compatibility or convenience codepoints). >>> >>> a composed character : a character made up of several codepoints >>> >>> Unicode encoding explained >>> ===================== >>> A convenience codepoint can therefore be thought of as a code point >>> used for a character which also has a composed form. >>> >>> The way Unicode works is that sometimes you can encode a character in >>> one byte, sometimes not. Sometimes you can encode it in two bytes, >>> sometimes not. >>> >>> You can therefore have a long stream of ASCII which is single-byte >>> Unicode. If there is an occasional Cyrillic or Greek character in the >>> stream, it would be represented either by a compatibility character or >>> by a multi-byte combination. >>> >>> Using compatibility characters can prevent proper sorting and >>> equivalence testing. >>> >>> Using "pure" Unicode, ie. "normal encodings", can cause compatibility >>> and round-tripping probelms. Although avoiding them can *also* cause >>> compatibility issues and round-tripping problems. >>> >>> Currently my thinking is: >>> >>> a Utf8String class >>> an Ordered collection, with 1 byte characters as the modal element, >>> but short arrays of wider strings where necessary >>> a Utf16String class >>> an Ordered collection, with 2 byte characters as the modal element, >>> but short arrays of wider strings >>> beginning with a 2-byte endianness indicator. >>> >>> Utf8Strings sometimes need to be sortable, and sometimes need to be compatible. >>> >>> So my thinking is that Utf8String will contain convenience codepoints, >>> for round-tripping. And where there are multiple convenience >>> codepoints for a character, that it standardises on one. >>> >>> And that there is a Utf8SortableString which uses *only* normal characters. >>> >>> We then need methods to convert between the two. >>> >>> aUtf8String asUtf8SortableString >>> >>> and >>> >>> aUtf8SortableString asUtf8String >>> >>> >>> Sort orders are culture and context dependent - Sweden and Germany >>> have different sort orders for the same diacritic-ed characters. Some >>> countries have one order in general usage, and another for specific >>> usages, such as phone directories (e.g. UK and France) >>> >>> Similarly for Utf16 : Utf16String and Utf16SortableString and >>> conversion methods >>> >>> A list of sorted words would be a SortedCollection, and there could be >>> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder, >>> seOrder, ukOrder, etc >>> >>> along the lines of >>> aListOfWords := SortedCollection sortBlock: deOrder >>> >>> If a word is either a Utf8SortableString, or a well-formed Utf8String, >>> then we can perform equivalence testing on them trivially. >>> >>> To make sure a Utf8String is well formed, we would need to have a way >>> of cleaning up any convenience codepoints which were valid, but which >>> were for a character which has multiple equally-valid alternative >>> convenience codepoints, and for which the string currently had the >>> "wrong" convenience codepoint. (i.e for any character with valid >>> alternative convenience codepoints, we would choose one to be in the >>> well-formed Utf8String, and we would need a method for cleaning the >>> alternative convenience codepoints out of the string, and replacing >>> them with the chosen approved convenience codepoint. >>> >>> aUtf8String cleanUtf8String >>> >>> With WideString, a lot of the issues disappear - except >>> round-tripping(although I'm sure I have seen something recently about >>> 4-byte strings that also have an additional bit. Which would make >>> some Unicode characters 5-bytes long.) >>> >>> >>> (I'm starting to zone out now - if I've overlooked anything - obvious, >>> subtle, or somewhere in between, please let me know) >>> >>> Cheers, >>> Euan >>> >>> >> > > |
On Sat, Dec 5, 2015 at 8:41 PM, Levente Uzonyi <[hidden email]> wrote: We do the same thing, but that doesn't mean it's a good idea to create a new String-like class having its content encoded in UTF-8, because UTF-8-encoded strings can't be modified like regular strings. While it would be possible to implement all operations, such implementation would become the next SortedCollection (bad performance due to misuse). Well, UTF-8 strings would have different performance tradeoffs than our existing string classes. Random-access would be expensive, in-place modification would be sometimes expensive, memory usage for non-English strings would be lower, encoding/decoding for IO would be eliminated. I find that's a good fit to some of my uses of strings, and don't mind thinking about the tradeoffs. YMMV. One I idea I've wondered about in the past is having classes instead of language tags. EnglishString, RomainianString etc, with encodings that make sense for the language. That would do a lot for m17n, without going for the full complexity of Unicode. It could also co-exist well with Utf8String, Utf16String etc, since those coudl be considered pseudo-languages/encodings. The downside would be that multi-lingual strings would be more difficult - you'd need ropes or the like. Colin |
In reply to this post by Levente Uzonyi
On 12/6/15, Levente Uzonyi <[hidden email]> wrote:
> On Sat, 5 Dec 2015, Colin Putney wrote: > >> First, what's UTF-32? Second, we have the whole language tag thing that >> nobody else uses. > > In Squeak, Strings use UTF-32 encoding[1]. It's straightforward > to see for WideString, but ByteString is just a subset of WideString, so > it uses the same encoding. We also use language tags, but that's a > different story. > Language tags make it possible to work around the problems introduced by > the Han unification[2]. We shouldn't really use them for non-CJKV > languages. > >> >> Finally, UTF-8 is a great encoding that certain kinds of applications >> really ought to use. Web apps, in particular, benefit from using UTF-8 so >> the don't have to decode and then re-encode strings coming in from the >> network. In DabbleDB we used UTF-8 encoded string in the image, and just >> ignored the fact that they were displayed incorrectly by inspectors. >> Having a proper UTF-8 string class would be useful. > > We do the same thing, but that doesn't mean it's a good idea to create a > new String-like class having its content encoded in UTF-8, because > UTF-8-encoded strings can't be modified like regular strings. While it > would be possible to implement all operations, such implementation would > become the next SortedCollection (bad performance due to misuse). This is not the case if you go for ropes https://github.com/KenDickey/Cuis-Smalltalk-Ropes > > Levente > > [1] https://en.wikipedia.org/wiki/UTF-32 > [2] https://en.wikipedia.org/wiki/Han_unification > >> >> - Colin >> >> >>> On Dec 4, 2015, at 6:46 AM, Levente Uzonyi <[hidden email]> wrote: >>> >>> Why would you want to have strings with UTF-8 or UTF-16 encoding in the >>> image? >>> What's wrong with the current UTF-32 representation? >>> >>> Levente >>> >>>> On Fri, 4 Dec 2015, EuanM wrote: >>>> >>>> I'm currently groping my way to seeing how feature-complete our >>>> Unicode support is. I am doing this to establish what still needs to >>>> be done to provide full Unicode support. >>>> >>>> This seems to me to be an area where it would be best to write it >>>> once, and then have the same codebase incorporated into the Smalltalks >>>> that most share a common ancestry. >>>> >>>> I am keen to get: equality-testing for strings; sortability for >>>> strings which have ligatures and diacritic characters; and correct >>>> round-tripping of data. >>>> >>>> Call to action: >>>> ========== >>>> >>>> If you have comments on these proposals - such as "but we already have >>>> that facility" or "the reason we do not have these facilities is >>>> because they are dog-slow" - please let me know them. >>>> >>>> If you would like to help out, please let me know. >>>> >>>> If you have Unicode experience and expertise, and would like to be, or >>>> would be willing to be, in the 'council of experts' for this project, >>>> please let me know. >>>> >>>> If you have comments or ideas on anything mentioned in this email >>>> >>>> In the first instance, the initiative's website will be: >>>> http://smalltalk.uk.to/unicode.html >>>> >>>> I have created a SqueakSource.com project called UnicodeSupport >>>> >>>> I want to avoid re-inventing any facilities which already exist. >>>> Except where they prevent us reaching the goals of: >>>> - sortable UTF8 strings >>>> - sortable UTF16 strings >>>> - equivalence testing of 2 UTF8 strings >>>> - equivalence testing of 2 UTF16 strings >>>> - round-tripping UTF8 strings through Smalltalk >>>> - roundtripping UTF16 strings through Smalltalk. >>>> As I understand it, we have limited Unicode support atm. >>>> >>>> Current state of play >>>> =============== >>>> ByteString gets converted to WideString when need is automagically >>>> detected. >>>> >>>> Is there anything else that currently exists? >>>> >>>> Definition of Terms >>>> ============== >>>> A quick definition of terms before I go any further: >>>> >>>> Standard terms from the Unicode standard >>>> =============================== >>>> a compatibility character : an additional encoding of a *normal* >>>> character, for compatibility and round-trip conversion purposes. For >>>> instance, a 1-byte encoding of a Latin character with a diacritic. >>>> >>>> Made-up terms >>>> ============ >>>> a convenience codepoint : a single codepoint which represents an item >>>> that is also encoded as a string of codepoints. >>>> >>>> (I tend to use the terms compatibility character and compatibility >>>> codepoint interchangably. The standard only refers to them as >>>> compatibility characters. However, the standard is determined to >>>> emphasise that characters are abstract and that codepoints are >>>> concrete. So I think it is often more useful and productive to think >>>> of compatibility or convenience codepoints). >>>> >>>> a composed character : a character made up of several codepoints >>>> >>>> Unicode encoding explained >>>> ===================== >>>> A convenience codepoint can therefore be thought of as a code point >>>> used for a character which also has a composed form. >>>> >>>> The way Unicode works is that sometimes you can encode a character in >>>> one byte, sometimes not. Sometimes you can encode it in two bytes, >>>> sometimes not. >>>> >>>> You can therefore have a long stream of ASCII which is single-byte >>>> Unicode. If there is an occasional Cyrillic or Greek character in the >>>> stream, it would be represented either by a compatibility character or >>>> by a multi-byte combination. >>>> >>>> Using compatibility characters can prevent proper sorting and >>>> equivalence testing. >>>> >>>> Using "pure" Unicode, ie. "normal encodings", can cause compatibility >>>> and round-tripping probelms. Although avoiding them can *also* cause >>>> compatibility issues and round-tripping problems. >>>> >>>> Currently my thinking is: >>>> >>>> a Utf8String class >>>> an Ordered collection, with 1 byte characters as the modal element, >>>> but short arrays of wider strings where necessary >>>> a Utf16String class >>>> an Ordered collection, with 2 byte characters as the modal element, >>>> but short arrays of wider strings >>>> beginning with a 2-byte endianness indicator. >>>> >>>> Utf8Strings sometimes need to be sortable, and sometimes need to be >>>> compatible. >>>> >>>> So my thinking is that Utf8String will contain convenience codepoints, >>>> for round-tripping. And where there are multiple convenience >>>> codepoints for a character, that it standardises on one. >>>> >>>> And that there is a Utf8SortableString which uses *only* normal >>>> characters. >>>> >>>> We then need methods to convert between the two. >>>> >>>> aUtf8String asUtf8SortableString >>>> >>>> and >>>> >>>> aUtf8SortableString asUtf8String >>>> >>>> >>>> Sort orders are culture and context dependent - Sweden and Germany >>>> have different sort orders for the same diacritic-ed characters. Some >>>> countries have one order in general usage, and another for specific >>>> usages, such as phone directories (e.g. UK and France) >>>> >>>> Similarly for Utf16 : Utf16String and Utf16SortableString and >>>> conversion methods >>>> >>>> A list of sorted words would be a SortedCollection, and there could be >>>> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder, >>>> seOrder, ukOrder, etc >>>> >>>> along the lines of >>>> aListOfWords := SortedCollection sortBlock: deOrder >>>> >>>> If a word is either a Utf8SortableString, or a well-formed Utf8String, >>>> then we can perform equivalence testing on them trivially. >>>> >>>> To make sure a Utf8String is well formed, we would need to have a way >>>> of cleaning up any convenience codepoints which were valid, but which >>>> were for a character which has multiple equally-valid alternative >>>> convenience codepoints, and for which the string currently had the >>>> "wrong" convenience codepoint. (i.e for any character with valid >>>> alternative convenience codepoints, we would choose one to be in the >>>> well-formed Utf8String, and we would need a method for cleaning the >>>> alternative convenience codepoints out of the string, and replacing >>>> them with the chosen approved convenience codepoint. >>>> >>>> aUtf8String cleanUtf8String >>>> >>>> With WideString, a lot of the issues disappear - except >>>> round-tripping(although I'm sure I have seen something recently about >>>> 4-byte strings that also have an additional bit. Which would make >>>> some Unicode characters 5-bytes long.) >>>> >>>> >>>> (I'm starting to zone out now - if I've overlooked anything - obvious, >>>> subtle, or somewhere in between, please let me know) >>>> >>>> Cheers, >>>> Euan >>>> >>>> >>> >> >> > > |
---------- Forwarded message ----------
From: Todd Blanchard <[hidden email]> Date: Sun, 06 Dec 2015 08:37:12 -0800 Subject: Re: [Pharo-dev] [squeak-dev] Unicode Support To: Pharo Development List <[hidden email]> (Resent because of bounce notification (email handling in osx is really beginning to annoy me). Sorry if its a dup) I used to worry a lot about strings being indexable. And then I eventually let go of that and realized that it isn't a particularly important property for them to have. I think you will find that UTF8 is generally the most convenient for a lot of things but its a bit like light in that you treat it alternately as a wave or particle depending on what you are trying to do. So goes strings - they can be treated alternately as streams or byte arrays (not character arrays - stop thinking in characters). In practice, this tends to not be a problem since a lot of the times when you want to replace a character or pick out the nth one you are doing something very computerish and the characters you are working with are the single byte (ASCII legacy) variety. You generally know when you can get away with that and when you can't. Otherwise you are most likely doing things that are best dealt with in a streaming paradigm. For most computation, you come to realize you don't generally care how many characters but how much space (bytes) you need to store your chunk of text. Collation is tricky and complicated in unicode in general but it isn't any worse in UTF8 than any other encoding. You are still going to scan each sortable item from front to back to determine its order, regardless. Most of the outside world has settled on UTF8 and any ASCII file is already UTF8 - which is why it ends up being so convenient. Most of our old text handling infrastructure can still handle UTF8 while it tends to choke on wider encodings. -Todd Blanchard > On Dec 6, 2015, at 07:23, H. Hirzel <[hidden email]> wrote: > >> We do the same thing, but that doesn't mean it's a good idea to create a >> new String-like class having its content encoded in UTF-8, because >> UTF-8-encoded strings can't be modified like regular strings. While it >> would be possible to implement all operations, such implementation would >> become the next SortedCollection (bad performance due to misuse). |
In reply to this post by Hannes Hirzel
On Sun, 6 Dec 2015, H. Hirzel wrote:
> On 12/6/15, Levente Uzonyi <[hidden email]> wrote: >> On Sat, 5 Dec 2015, Colin Putney wrote: >> >>> First, what's UTF-32? Second, we have the whole language tag thing that >>> nobody else uses. >> >> In Squeak, Strings use UTF-32 encoding[1]. It's straightforward >> to see for WideString, but ByteString is just a subset of WideString, so >> it uses the same encoding. We also use language tags, but that's a >> different story. >> Language tags make it possible to work around the problems introduced by >> the Han unification[2]. We shouldn't really use them for non-CJKV >> languages. >> >>> >>> Finally, UTF-8 is a great encoding that certain kinds of applications >>> really ought to use. Web apps, in particular, benefit from using UTF-8 so >>> the don't have to decode and then re-encode strings coming in from the >>> network. In DabbleDB we used UTF-8 encoded string in the image, and just >>> ignored the fact that they were displayed incorrectly by inspectors. >>> Having a proper UTF-8 string class would be useful. >> >> We do the same thing, but that doesn't mean it's a good idea to create a >> new String-like class having its content encoded in UTF-8, because >> UTF-8-encoded strings can't be modified like regular strings. While it >> would be possible to implement all operations, such implementation would >> become the next SortedCollection (bad performance due to misuse). > > > This is not the case if you go for ropes > > https://github.com/KenDickey/Cuis-Smalltalk-Ropes Ropes are nice, but they are not strings. Levente > >> >> Levente >> >> [1] https://en.wikipedia.org/wiki/UTF-32 >> [2] https://en.wikipedia.org/wiki/Han_unification >> >>> >>> - Colin >>> >>> >>>> On Dec 4, 2015, at 6:46 AM, Levente Uzonyi <[hidden email]> wrote: >>>> >>>> Why would you want to have strings with UTF-8 or UTF-16 encoding in the >>>> image? >>>> What's wrong with the current UTF-32 representation? >>>> >>>> Levente >>>> >>>>> On Fri, 4 Dec 2015, EuanM wrote: >>>>> >>>>> I'm currently groping my way to seeing how feature-complete our >>>>> Unicode support is. I am doing this to establish what still needs to >>>>> be done to provide full Unicode support. >>>>> >>>>> This seems to me to be an area where it would be best to write it >>>>> once, and then have the same codebase incorporated into the Smalltalks >>>>> that most share a common ancestry. >>>>> >>>>> I am keen to get: equality-testing for strings; sortability for >>>>> strings which have ligatures and diacritic characters; and correct >>>>> round-tripping of data. >>>>> >>>>> Call to action: >>>>> ========== >>>>> >>>>> If you have comments on these proposals - such as "but we already have >>>>> that facility" or "the reason we do not have these facilities is >>>>> because they are dog-slow" - please let me know them. >>>>> >>>>> If you would like to help out, please let me know. >>>>> >>>>> If you have Unicode experience and expertise, and would like to be, or >>>>> would be willing to be, in the 'council of experts' for this project, >>>>> please let me know. >>>>> >>>>> If you have comments or ideas on anything mentioned in this email >>>>> >>>>> In the first instance, the initiative's website will be: >>>>> http://smalltalk.uk.to/unicode.html >>>>> >>>>> I have created a SqueakSource.com project called UnicodeSupport >>>>> >>>>> I want to avoid re-inventing any facilities which already exist. >>>>> Except where they prevent us reaching the goals of: >>>>> - sortable UTF8 strings >>>>> - sortable UTF16 strings >>>>> - equivalence testing of 2 UTF8 strings >>>>> - equivalence testing of 2 UTF16 strings >>>>> - round-tripping UTF8 strings through Smalltalk >>>>> - roundtripping UTF16 strings through Smalltalk. >>>>> As I understand it, we have limited Unicode support atm. >>>>> >>>>> Current state of play >>>>> =============== >>>>> ByteString gets converted to WideString when need is automagically >>>>> detected. >>>>> >>>>> Is there anything else that currently exists? >>>>> >>>>> Definition of Terms >>>>> ============== >>>>> A quick definition of terms before I go any further: >>>>> >>>>> Standard terms from the Unicode standard >>>>> =============================== >>>>> a compatibility character : an additional encoding of a *normal* >>>>> character, for compatibility and round-trip conversion purposes. For >>>>> instance, a 1-byte encoding of a Latin character with a diacritic. >>>>> >>>>> Made-up terms >>>>> ============ >>>>> a convenience codepoint : a single codepoint which represents an item >>>>> that is also encoded as a string of codepoints. >>>>> >>>>> (I tend to use the terms compatibility character and compatibility >>>>> codepoint interchangably. The standard only refers to them as >>>>> compatibility characters. However, the standard is determined to >>>>> emphasise that characters are abstract and that codepoints are >>>>> concrete. So I think it is often more useful and productive to think >>>>> of compatibility or convenience codepoints). >>>>> >>>>> a composed character : a character made up of several codepoints >>>>> >>>>> Unicode encoding explained >>>>> ===================== >>>>> A convenience codepoint can therefore be thought of as a code point >>>>> used for a character which also has a composed form. >>>>> >>>>> The way Unicode works is that sometimes you can encode a character in >>>>> one byte, sometimes not. Sometimes you can encode it in two bytes, >>>>> sometimes not. >>>>> >>>>> You can therefore have a long stream of ASCII which is single-byte >>>>> Unicode. If there is an occasional Cyrillic or Greek character in the >>>>> stream, it would be represented either by a compatibility character or >>>>> by a multi-byte combination. >>>>> >>>>> Using compatibility characters can prevent proper sorting and >>>>> equivalence testing. >>>>> >>>>> Using "pure" Unicode, ie. "normal encodings", can cause compatibility >>>>> and round-tripping probelms. Although avoiding them can *also* cause >>>>> compatibility issues and round-tripping problems. >>>>> >>>>> Currently my thinking is: >>>>> >>>>> a Utf8String class >>>>> an Ordered collection, with 1 byte characters as the modal element, >>>>> but short arrays of wider strings where necessary >>>>> a Utf16String class >>>>> an Ordered collection, with 2 byte characters as the modal element, >>>>> but short arrays of wider strings >>>>> beginning with a 2-byte endianness indicator. >>>>> >>>>> Utf8Strings sometimes need to be sortable, and sometimes need to be >>>>> compatible. >>>>> >>>>> So my thinking is that Utf8String will contain convenience codepoints, >>>>> for round-tripping. And where there are multiple convenience >>>>> codepoints for a character, that it standardises on one. >>>>> >>>>> And that there is a Utf8SortableString which uses *only* normal >>>>> characters. >>>>> >>>>> We then need methods to convert between the two. >>>>> >>>>> aUtf8String asUtf8SortableString >>>>> >>>>> and >>>>> >>>>> aUtf8SortableString asUtf8String >>>>> >>>>> >>>>> Sort orders are culture and context dependent - Sweden and Germany >>>>> have different sort orders for the same diacritic-ed characters. Some >>>>> countries have one order in general usage, and another for specific >>>>> usages, such as phone directories (e.g. UK and France) >>>>> >>>>> Similarly for Utf16 : Utf16String and Utf16SortableString and >>>>> conversion methods >>>>> >>>>> A list of sorted words would be a SortedCollection, and there could be >>>>> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder, >>>>> seOrder, ukOrder, etc >>>>> >>>>> along the lines of >>>>> aListOfWords := SortedCollection sortBlock: deOrder >>>>> >>>>> If a word is either a Utf8SortableString, or a well-formed Utf8String, >>>>> then we can perform equivalence testing on them trivially. >>>>> >>>>> To make sure a Utf8String is well formed, we would need to have a way >>>>> of cleaning up any convenience codepoints which were valid, but which >>>>> were for a character which has multiple equally-valid alternative >>>>> convenience codepoints, and for which the string currently had the >>>>> "wrong" convenience codepoint. (i.e for any character with valid >>>>> alternative convenience codepoints, we would choose one to be in the >>>>> well-formed Utf8String, and we would need a method for cleaning the >>>>> alternative convenience codepoints out of the string, and replacing >>>>> them with the chosen approved convenience codepoint. >>>>> >>>>> aUtf8String cleanUtf8String >>>>> >>>>> With WideString, a lot of the issues disappear - except >>>>> round-tripping(although I'm sure I have seen something recently about >>>>> 4-byte strings that also have an additional bit. Which would make >>>>> some Unicode characters 5-bytes long.) >>>>> >>>>> >>>>> (I'm starting to zone out now - if I've overlooked anything - obvious, >>>>> subtle, or somewhere in between, please let me know) >>>>> >>>>> Cheers, >>>>> Euan >>>>> >>>>> >>>> >>> >>> >> >> > > |
In reply to this post by EuanM
This a long email. A *lot* of it is encapsulated in the Venn diagram both:
http://smalltalk.uk.to/unicode-utf8.html and my Smalltalk in Small Steps blog at: http://smalltalkinsmallsteps.blogspot.co.uk/2015/12/utf-8-for-cuis-pharo-and-squeak.html My current thinking, and understanding. ============================== 0) a) ASCII and ISO-8859-1 consist of characters encoded in 1 byte. b) UTF-8 can encode all of those characters in 1 byte, but can prefer some of them to be encoded as sequences of multiple bytes. And can encode additional characters as sequences of multiple bytes. 1) Smalltalk has long had multiple String classes. 2) Any UTF16 Unicode codepoint which has a codepoint of 00nn hex is encoded as a UTF-8 codepoint of nn hex. 3) All valid ISO-8859-1 characters have a character code between 20 hex and 7E hex, or between A0 hex and FF hex. https://en.wikipedia.org/wiki/ISO/IEC_8859-1 4) All valid ASCII characters have a character code between 00 hex and 7E hex. https://en.wikipedia.org/wiki/ASCII 5) a) All character codes which are defined within ISO-8859-1 and also defined within ASCII. (i.e. character codes 20 hex to 7E hex) are defined identically in both. b) All printable ASCII characters are defined identically in both ASCII and ISO-8859-1 6) All character codes defined in ASCII (00 hex to 7E hex) are defined identically in Unicode UTF-8. 7) All character codes defined in ISO-8859-1 (20 hex - 7E hex ; A0 hex - FF hex ) are defined identically in UTF-8. 8) => some Unicode codepoints map to both ASCII and ISO-8859-1. all ASCII maps 1:1 to Unicode UTF-8 all ISO-8859-1 maps 1:1 to Unicode UTF-8 9) All ByteStrings elements which are either a valid ISO-8859-1 character or a valid ASCII character are *also* a valid UTF-8 character. 10) ISO-8859-1 characters representing a character with a diacritic, or a two-character ligature, have no ASCII equivalent. In Unicode UTF-8, those character codes which are representing compound glyphs, are called "compatibility codepoints". 11) The preferred Unicode representation of the characters which have compatibility codepoints is as a a short set of codepoints representing the characters which are combined together to form the glyph of the convenience codepoint, as a sequence of bytes representing the component characters. 12) Some concrete examples: A - aka Upper Case A In ASCII, in ISO 8859-1 ASCII A - 41 hex ISO-8859-1 A - 41 hex UTF-8 A - 41 hex BEL (a bell sound, often invoked by a Ctrl-g keyboard chord) In ASCII, not in ISO 8859-1 ASCII : BEL - 07 hex ISO-8859-1 : 07 hex is not a valid character code UTF-8 : BEL - 07 hex £ (GBP currency symbol) In ISO-8859-1, not in ASCII ASCII : A3 hex is not a valid ASCII code UTF-8: £ - A3 hex ISO-8859-1: £ - A3 hex Upper Case C cedilla In ISO-8859-1, not in ASCII, in UTF-8 as a compatibility codepoint *and* a composed set of codepoints ASCII : C7 hex is not a valid ASCII character code ISO-8859-1 : Upper Case C cedilla - C7 hex UTF-8 : Upper Case C cedilla (compatibility codepoint) - C7 hex Unicode preferred Upper Case C cedilla (composed set of codepoints) Upper case C 0043 hex (Upper case C) followed by cedilla 00B8 hex (cedilla) 13) For any valid ASCII string *and* for any valid ISO-8859-1 string, aByteString is completely adequate for editing and display. 14) When sorting any valid ASCII string *or* any valid ISO-8859-1 string, upper and lower case versions of the same character will be treated differently. 15) When sorting any valid ISO-8859-1 string containing letter+diacritic combination glyphs or ligature combination glyphs, the glyphs in combination will treated differently to a "plain" glyph of the character i.e. "C" and "C cedilla" will be treated very differently. "ß" and "fs" will be treated very differently. 16) Different nations have different rules about where diacritic-ed characted and ligature pairs should be placed when in alphabetical order. 17) Some nations even have multiple standards - e.g. surnames beginning either "M superscript-c" or "M superscript-a superscript-c" are treated as beginning equivalently in UK phone directories, but not in other situations. Some practical upshots ================== 1) Cuis and its ISO-8859-1 encoding is *exactly* the same as UTF-8, for any single character it considers valid, or any ByteString it has made up of characters it considers valid. 2) Any ByteString is valid UTF-8 in any of Squeak, Pharo, Cuis or any other Smalltalk with a single byte ByteString following ASCII or ISO-8859-1. 3) Any Smalltalk (or derivative language) using ByteString can immediately consider it's ByteString as valid UTF-8, as long as it also considers the ByteSring as valid ASCII and/or ISO-8859-1. 4) All of those can be successfully exported to any system using UTF-8 (e.g. HTML). 5) To successfully *accept* all UTF-8 we much be able to do either: a) accept UTF-8 strings with composed characters b) convert UTF-8 strings with composed characters into UTF-8 strings that use *only* compatibility codepoints. Class + protocol proposals a Utf8CompatibilityString class. asByteString - ensure only compatibility codepoints are used. Ensure it doews not encode characters above 00FF hex. asIso8859String - ensures only compatibility codepoints are used, and that the characters are each valid ISO 8859-1 asAsciiString - ensures only characters 00hex - 7F hex are used. asUtf8ComposedIso8859String - ensures all compatibility codepoints are expanded into small OrderedCollections of codepoints a Utf8ComposedIso8859String class - will provide sortable and comparable UTF8 strings of all ASCII and ISO 8859-1 strings. Then a Utf8SortableCollection class - a collection of Utf8ComposedIso8859Strings words and phrases. Custom sortBlocks will define the applicable sort order. We can create a collection... a Dictionary, thinking about it, of named, prefabricated sortBlocks. This will work for all UTF8 strings of ISO-8859-1 and ASCII strings. If anyone has better names for the classes, please let me know. If anyone else wants to help - build these, - create SUnit tests for these - write documentation for these Please let me know. n.b. I have had absolutely no experience of Ropes. My own background with this stuff: In the early 90's as a Project Manager implementing office automation systems across a global company, with offices in the Americas, Western, Eastern and Central Europe, (including Slavic and Cyrillic users) nations, Japan and China. The mission-critical application was word-processing. Our offices were spread around the globe, and we needed those offices to successfully exchange documents with their sister offices, and with the customers in each region the offices were in. Unicode was then new, and our platform supplier was the NeXT Corporation, who had been founder members in of the Unicode Consortium in 1990. So far: I've read the latest version of the Unicode Standard (v8.0). This is freely downloadable. I've purchased a paper copy of an earlier release. New releases typically consist additional codespaces (i.e. alphabets). So old copies are useful, as well as cheap. (Paper copies of version 4.0 are available second-hand for < $10 / €10). The typical change with each release is the addition of further codespaces (i.e alphabets (more or less) ), so you don't lose a lot. (I'll be going through my V4.0 just to make sure) Cheers, Euan On 5 December 2015 at 13:08, stepharo <[hidden email]> wrote: > Hi EuanM > > Le 4/12/15 12:42, EuanM a écrit : >> >> I'm currently groping my way to seeing how feature-complete our >> Unicode support is. I am doing this to establish what still needs to >> be done to provide full Unicode support. > > > this is great. Thanks for pushing this. I wrote and collected some roadmap > (analyses on different topics) > on the pharo github project feel free to add this one there. >> >> >> This seems to me to be an area where it would be best to write it >> once, and then have the same codebase incorporated into the Smalltalks >> that most share a common ancestry. >> >> I am keen to get: equality-testing for strings; sortability for >> strings which have ligatures and diacritic characters; and correct >> round-tripping of data. > > Go! > My suggestion is > start small > make steady progress > write tests > commit often :) > > Stef > > What is the french phoneBook ordering because this is the first time I hear > about it. > >> >> Call to action: >> ========== >> >> If you have comments on these proposals - such as "but we already have >> that facility" or "the reason we do not have these facilities is >> because they are dog-slow" - please let me know them. >> >> If you would like to help out, please let me know. >> >> If you have Unicode experience and expertise, and would like to be, or >> would be willing to be, in the 'council of experts' for this project, >> please let me know. >> >> If you have comments or ideas on anything mentioned in this email >> >> In the first instance, the initiative's website will be: >> http://smalltalk.uk.to/unicode.html >> >> I have created a SqueakSource.com project called UnicodeSupport >> >> I want to avoid re-inventing any facilities which already exist. >> Except where they prevent us reaching the goals of: >> - sortable UTF8 strings >> - sortable UTF16 strings >> - equivalence testing of 2 UTF8 strings >> - equivalence testing of 2 UTF16 strings >> - round-tripping UTF8 strings through Smalltalk >> - roundtripping UTF16 strings through Smalltalk. >> As I understand it, we have limited Unicode support atm. >> >> Current state of play >> =============== >> ByteString gets converted to WideString when need is automagically >> detected. >> >> Is there anything else that currently exists? >> >> Definition of Terms >> ============== >> A quick definition of terms before I go any further: >> >> Standard terms from the Unicode standard >> =============================== >> a compatibility character : an additional encoding of a *normal* >> character, for compatibility and round-trip conversion purposes. For >> instance, a 1-byte encoding of a Latin character with a diacritic. >> >> Made-up terms >> ============ >> a convenience codepoint : a single codepoint which represents an item >> that is also encoded as a string of codepoints. >> >> (I tend to use the terms compatibility character and compatibility >> codepoint interchangably. The standard only refers to them as >> compatibility characters. However, the standard is determined to >> emphasise that characters are abstract and that codepoints are >> concrete. So I think it is often more useful and productive to think >> of compatibility or convenience codepoints). >> >> a composed character : a character made up of several codepoints >> >> Unicode encoding explained >> ===================== >> A convenience codepoint can therefore be thought of as a code point >> used for a character which also has a composed form. >> >> The way Unicode works is that sometimes you can encode a character in >> one byte, sometimes not. Sometimes you can encode it in two bytes, >> sometimes not. >> >> You can therefore have a long stream of ASCII which is single-byte >> Unicode. If there is an occasional Cyrillic or Greek character in the >> stream, it would be represented either by a compatibility character or >> by a multi-byte combination. >> >> Using compatibility characters can prevent proper sorting and >> equivalence testing. >> >> Using "pure" Unicode, ie. "normal encodings", can cause compatibility >> and round-tripping probelms. Although avoiding them can *also* cause >> compatibility issues and round-tripping problems. >> >> Currently my thinking is: >> >> a Utf8String class >> an Ordered collection, with 1 byte characters as the modal element, >> but short arrays of wider strings where necessary >> a Utf16String class >> an Ordered collection, with 2 byte characters as the modal element, >> but short arrays of wider strings >> beginning with a 2-byte endianness indicator. >> >> Utf8Strings sometimes need to be sortable, and sometimes need to be >> compatible. >> >> So my thinking is that Utf8String will contain convenience codepoints, >> for round-tripping. And where there are multiple convenience >> codepoints for a character, that it standardises on one. >> >> And that there is a Utf8SortableString which uses *only* normal >> characters. >> >> We then need methods to convert between the two. >> >> aUtf8String asUtf8SortableString >> >> and >> >> aUtf8SortableString asUtf8String >> >> >> Sort orders are culture and context dependent - Sweden and Germany >> have different sort orders for the same diacritic-ed characters. Some >> countries have one order in general usage, and another for specific >> usages, such as phone directories (e.g. UK and France) >> >> Similarly for Utf16 : Utf16String and Utf16SortableString and >> conversion methods >> >> A list of sorted words would be a SortedCollection, and there could be >> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder, >> seOrder, ukOrder, etc >> >> along the lines of >> aListOfWords := SortedCollection sortBlock: deOrder >> >> If a word is either a Utf8SortableString, or a well-formed Utf8String, >> then we can perform equivalence testing on them trivially. >> >> To make sure a Utf8String is well formed, we would need to have a way >> of cleaning up any convenience codepoints which were valid, but which >> were for a character which has multiple equally-valid alternative >> convenience codepoints, and for which the string currently had the >> "wrong" convenience codepoint. (i.e for any character with valid >> alternative convenience codepoints, we would choose one to be in the >> well-formed Utf8String, and we would need a method for cleaning the >> alternative convenience codepoints out of the string, and replacing >> them with the chosen approved convenience codepoint. >> >> aUtf8String cleanUtf8String >> >> With WideString, a lot of the issues disappear - except >> round-tripping(although I'm sure I have seen something recently about >> 4-byte strings that also have an additional bit. Which would make >> some Unicode characters 5-bytes long.) >> >> >> (I'm starting to zone out now - if I've overlooked anything - obvious, >> subtle, or somewhere in between, please let me know) >> >> Cheers, >> Euan >> >> > > |
In reply to this post by EuanM
"Canonicalisation and sorting issues are hardly discussed.
In one place, the fact that a lot of special characters can have multiple representations is a big argument, while it is not mentioned how just treating things like a byte sequence would solve this (it doesn't AFAIU). Like how do you search for $e or $é if you know that it is possible to represent $é as just $é and as $e + $´ ?" This, for me, is one of the chief purposes of Unicode support. What you have it a convertor for "might contain compatibility codepoints" to "contains only composed sequences of codepoints, and no compatibility codepoints". As long as you're not using Strings where you should use Streams, it should be okay. And of course, for passing back to ISO Latin 1 or ASCII systems, you need to have a convertor to "contains only compatibility codepoints, and no composed sets of codepoints". As long as you can tell one type from the other, it's not a problem. Any string that mixes both can be converted in either direction by the same methods which I've just outlined. Once you have these, we can do this for all 1 byte characters. We can then expand this to have Classes and methods for character strings which contain the occasional character from other ISO character sets. Cheers, Euan On 6 December 2015 at 17:44, Sven Van Caekenberghe <[hidden email]> wrote: > >> On 05 Dec 2015, at 17:35, Todd Blanchard <[hidden email]> wrote: >> >> would suggest that the only worthwhile encoding is UTF8 - the rest are distractions except for being able to read and convert from other encodings to UTF8. UTF16 is a complete waste of time. >> >> Read http://utf8everywhere.org/ >> >> I have extensive Unicode chops from around 1999 to 2004 and my experience leads me to strongly agree with the views on that site. > > Well, I read the page/document/site as well. It was very interesting indeed, thanks for sharing it. > > In some sense it made me reconsider my aversion against in-image utf-8 encoding, maybe it could have some value. Absolute storage is more efficient, some processing might also be more efficient, i/o conversions to/from utf-8 become a no-op. What I found nice is the suggestion that most structured parsing (XML, JSON, CSV, STON, ...) could actually ignore the encoding for a large part and just assume its ASCII, which would/could be nice for performance. Also the fact that a lot of strings are (or should be) treated as opaque makes a lot of sense. > > What I did not like is that much of argumentation is based on issue in the Windows world, take all that away and the document shrinks in half. I would have liked a bit more fundamental CS arguments. > > Canonicalisation and sorting issues are hardly discussed. > > In one place, the fact that a lot of special characters can have multiple representations is a big argument, while it is not mentioned how just treating things like a byte sequence would solve this (it doesn't AFAIU). Like how do you search for $e or $é if you know that it is possible to represent $é as just $é and as $e + $´ ? > > Sven > >> Sent from the road >> >> On Dec 5, 2015, at 05:08, stepharo <[hidden email]> wrote: >> >>> Hi EuanM >>> >>> Le 4/12/15 12:42, EuanM a écrit : >>>> I'm currently groping my way to seeing how feature-complete our >>>> Unicode support is. I am doing this to establish what still needs to >>>> be done to provide full Unicode support. >>> >>> this is great. Thanks for pushing this. I wrote and collected some roadmap (analyses on different topics) >>> on the pharo github project feel free to add this one there. >>>> >>>> This seems to me to be an area where it would be best to write it >>>> once, and then have the same codebase incorporated into the Smalltalks >>>> that most share a common ancestry. >>>> >>>> I am keen to get: equality-testing for strings; sortability for >>>> strings which have ligatures and diacritic characters; and correct >>>> round-tripping of data. >>> Go! >>> My suggestion is >>> start small >>> make steady progress >>> write tests >>> commit often :) >>> >>> Stef >>> >>> What is the french phoneBook ordering because this is the first time I hear about it. >>>> >>>> Call to action: >>>> ========== >>>> >>>> If you have comments on these proposals - such as "but we already have >>>> that facility" or "the reason we do not have these facilities is >>>> because they are dog-slow" - please let me know them. >>>> >>>> If you would like to help out, please let me know. >>>> >>>> If you have Unicode experience and expertise, and would like to be, or >>>> would be willing to be, in the 'council of experts' for this project, >>>> please let me know. >>>> >>>> If you have comments or ideas on anything mentioned in this email >>>> >>>> In the first instance, the initiative's website will be: >>>> http://smalltalk.uk.to/unicode.html >>>> >>>> I have created a SqueakSource.com project called UnicodeSupport >>>> >>>> I want to avoid re-inventing any facilities which already exist. >>>> Except where they prevent us reaching the goals of: >>>> - sortable UTF8 strings >>>> - sortable UTF16 strings >>>> - equivalence testing of 2 UTF8 strings >>>> - equivalence testing of 2 UTF16 strings >>>> - round-tripping UTF8 strings through Smalltalk >>>> - roundtripping UTF16 strings through Smalltalk. >>>> As I understand it, we have limited Unicode support atm. >>>> >>>> Current state of play >>>> =============== >>>> ByteString gets converted to WideString when need is automagically detected. >>>> >>>> Is there anything else that currently exists? >>>> >>>> Definition of Terms >>>> ============== >>>> A quick definition of terms before I go any further: >>>> >>>> Standard terms from the Unicode standard >>>> =============================== >>>> a compatibility character : an additional encoding of a *normal* >>>> character, for compatibility and round-trip conversion purposes. For >>>> instance, a 1-byte encoding of a Latin character with a diacritic. >>>> >>>> Made-up terms >>>> ============ >>>> a convenience codepoint : a single codepoint which represents an item >>>> that is also encoded as a string of codepoints. >>>> >>>> (I tend to use the terms compatibility character and compatibility >>>> codepoint interchangably. The standard only refers to them as >>>> compatibility characters. However, the standard is determined to >>>> emphasise that characters are abstract and that codepoints are >>>> concrete. So I think it is often more useful and productive to think >>>> of compatibility or convenience codepoints). >>>> >>>> a composed character : a character made up of several codepoints >>>> >>>> Unicode encoding explained >>>> ===================== >>>> A convenience codepoint can therefore be thought of as a code point >>>> used for a character which also has a composed form. >>>> >>>> The way Unicode works is that sometimes you can encode a character in >>>> one byte, sometimes not. Sometimes you can encode it in two bytes, >>>> sometimes not. >>>> >>>> You can therefore have a long stream of ASCII which is single-byte >>>> Unicode. If there is an occasional Cyrillic or Greek character in the >>>> stream, it would be represented either by a compatibility character or >>>> by a multi-byte combination. >>>> >>>> Using compatibility characters can prevent proper sorting and >>>> equivalence testing. >>>> >>>> Using "pure" Unicode, ie. "normal encodings", can cause compatibility >>>> and round-tripping probelms. Although avoiding them can *also* cause >>>> compatibility issues and round-tripping problems. >>>> >>>> Currently my thinking is: >>>> >>>> a Utf8String class >>>> an Ordered collection, with 1 byte characters as the modal element, >>>> but short arrays of wider strings where necessary >>>> a Utf16String class >>>> an Ordered collection, with 2 byte characters as the modal element, >>>> but short arrays of wider strings >>>> beginning with a 2-byte endianness indicator. >>>> >>>> Utf8Strings sometimes need to be sortable, and sometimes need to be compatible. >>>> >>>> So my thinking is that Utf8String will contain convenience codepoints, >>>> for round-tripping. And where there are multiple convenience >>>> codepoints for a character, that it standardises on one. >>>> >>>> And that there is a Utf8SortableString which uses *only* normal characters. >>>> >>>> We then need methods to convert between the two. >>>> >>>> aUtf8String asUtf8SortableString >>>> >>>> and >>>> >>>> aUtf8SortableString asUtf8String >>>> >>>> >>>> Sort orders are culture and context dependent - Sweden and Germany >>>> have different sort orders for the same diacritic-ed characters. Some >>>> countries have one order in general usage, and another for specific >>>> usages, such as phone directories (e.g. UK and France) >>>> >>>> Similarly for Utf16 : Utf16String and Utf16SortableString and >>>> conversion methods >>>> >>>> A list of sorted words would be a SortedCollection, and there could be >>>> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder, >>>> seOrder, ukOrder, etc >>>> >>>> along the lines of >>>> aListOfWords := SortedCollection sortBlock: deOrder >>>> >>>> If a word is either a Utf8SortableString, or a well-formed Utf8String, >>>> then we can perform equivalence testing on them trivially. >>>> >>>> To make sure a Utf8String is well formed, we would need to have a way >>>> of cleaning up any convenience codepoints which were valid, but which >>>> were for a character which has multiple equally-valid alternative >>>> convenience codepoints, and for which the string currently had the >>>> "wrong" convenience codepoint. (i.e for any character with valid >>>> alternative convenience codepoints, we would choose one to be in the >>>> well-formed Utf8String, and we would need a method for cleaning the >>>> alternative convenience codepoints out of the string, and replacing >>>> them with the chosen approved convenience codepoint. >>>> >>>> aUtf8String cleanUtf8String >>>> >>>> With WideString, a lot of the issues disappear - except >>>> round-tripping(although I'm sure I have seen something recently about >>>> 4-byte strings that also have an additional bit. Which would make >>>> some Unicode characters 5-bytes long.) >>>> >>>> >>>> (I'm starting to zone out now - if I've overlooked anything - obvious, >>>> subtle, or somewhere in between, please let me know) >>>> >>>> Cheers, >>>> Euan >>>> >>>> >>> >>> > > |
In reply to this post by EuanM
Verifying assumptions is the key reason why you should documents like
this out for review. Sven - Cuis is encoded with ISO 8859-15 (aka ISO Latin 9) Sven, this is *NOT* as you state, ISO 99591, (and not as I stated, 8859-1). We caught the right specification bug for the wrong reason. Juan: "Cuis: Chose not to use Squeak approach. Chose to make the base image include and use only 1-byte strings. Chose to use ISO-8859-15" I have double-checked - each character encoded in ISO Latin 15 (ISO 8859-15) is exactly the character represented by the corresponding 1-byte codepoint in Unicode 0000 to 00FF, with the following exceptions: codepoint 20ac - Euro symbol character code a4 (replaces codepoint 00a4 generic currency symbol) codepoint 0160 Latin Upper Case S with Caron character code a6 (replaces codepoint 00A6 was | Unix pipe character) codepoint 0161 Latin Lower Case s with Caron character code a8 (replaces codepoint 00A8 was dierisis) codepoint 017d Latin Upper Case Z with Caron character code b4 (replaces codepoint 00b4 was Acute accent) codepoint 017e Latin Lower Case Z with Caron character code b8 (replaces codepoint 00b8 was cedilla) codepoint 0152 Upper Case OE ligature = Ethel character code bc (replaces codepoint 00bc was 1/4 symbol) codepoint 0153 Lower Case oe ligature = ethel character code bd (replaces codepoint 00bd was 1/2 symbol) codepoint 0178 Upper Case Y diaeresis character code be (replaces codepoint 00be was 3/4 symbol) Juan - I don't suppose we could persuade you to change to ISO Latin-1 from ISO Latin-9 ? It means we could run the same 1 byte string encoding across Cuis, Squeak, Pharo, and, as far as I can make out so far, Dolphin Smalltalk and Gnu Smalltalk. The downside would be that French Y diaeresis would lose the ability to use that character, along with users of oe, OE, and s, S, z, Z with caron. Along with the Euro. https://en.wikipedia.org/wiki/ISO/IEC_8859-15. I'm confident I understand the use of UTF-8 in principle. On 7 December 2015 at 08:27, Sven Van Caekenberghe <[hidden email]> wrote: > I am sorry but one of your basic assumptions is completely wrong: > > 'Les élèves Français' encodeWith: #iso99591. > > => #[76 101 115 32 233 108 232 118 101 115 32 70 114 97 110 231 97 105 115] > > 'Les élèves Français' utf8Encoded. > > => #[76 101 115 32 195 169 108 195 168 118 101 115 32 70 114 97 110 195 167 97 105 115] > > ISO-9959-1 (~Latin1) is NOT AT ALL identical to UTF-8 in its upper, non-ACII part !! > > Or shorter, $é is encoded in ISO-9959-1 as #[233], but as #[195 169] in UTF-8. > > So more than half the points you make, or the facts that you state, are thus plain wrong. > > The only thing that is correct is that the code points are equal, but that is not the same as the encoding ! > > From this I am inclined to conclude that you do not fundamentally understand how UTF-8 works, which does not strike me as good basis to design something called a UTF8String. > > Sorry. > > PS: Note also that Cuis' choice to use ISO-9959-1 only is pretty limiting in a Unicode world. > >> On 07 Dec 2015, at 04:21, EuanM <[hidden email]> wrote: >> >> This a long email. A *lot* of it is encapsulated in the Venn diagram both: >> http://smalltalk.uk.to/unicode-utf8.html >> and my Smalltalk in Small Steps blog at: >> http://smalltalkinsmallsteps.blogspot.co.uk/2015/12/utf-8-for-cuis-pharo-and-squeak.html >> >> My current thinking, and understanding. >> ============================== >> >> 0) a) ASCII and ISO-8859-1 consist of characters encoded in 1 byte. >> b) UTF-8 can encode all of those characters in 1 byte, but can >> prefer some of them to be encoded as sequences of multiple bytes. And >> can encode additional characters as sequences of multiple bytes. >> >> 1) Smalltalk has long had multiple String classes. >> >> 2) Any UTF16 Unicode codepoint which has a codepoint of 00nn hex >> is encoded as a UTF-8 codepoint of nn hex. >> >> 3) All valid ISO-8859-1 characters have a character code between 20 >> hex and 7E hex, or between A0 hex and FF hex. >> https://en.wikipedia.org/wiki/ISO/IEC_8859-1 >> >> 4) All valid ASCII characters have a character code between 00 hex and 7E hex. >> https://en.wikipedia.org/wiki/ASCII >> >> >> 5) a) All character codes which are defined within ISO-8859-1 and also >> defined within ASCII. (i.e. character codes 20 hex to 7E hex) are >> defined identically in both. >> >> b) All printable ASCII characters are defined identically in both >> ASCII and ISO-8859-1 >> >> 6) All character codes defined in ASCII (00 hex to 7E hex) are >> defined identically in Unicode UTF-8. >> >> 7) All character codes defined in ISO-8859-1 (20 hex - 7E hex ; A0 hex >> - FF hex ) are defined identically in UTF-8. >> >> 8) => some Unicode codepoints map to both ASCII and ISO-8859-1. >> all ASCII maps 1:1 to Unicode UTF-8 >> all ISO-8859-1 maps 1:1 to Unicode UTF-8 >> >> 9) All ByteStrings elements which are either a valid ISO-8859-1 >> character or a valid ASCII character are *also* a valid UTF-8 >> character. >> >> 10) ISO-8859-1 characters representing a character with a diacritic, >> or a two-character ligature, have no ASCII equivalent. In Unicode >> UTF-8, those character codes which are representing compound glyphs, >> are called "compatibility codepoints". >> >> 11) The preferred Unicode representation of the characters which have >> compatibility codepoints is as a a short set of codepoints >> representing the characters which are combined together to form the >> glyph of the convenience codepoint, as a sequence of bytes >> representing the component characters. >> >> >> 12) Some concrete examples: >> >> A - aka Upper Case A >> In ASCII, in ISO 8859-1 >> ASCII A - 41 hex >> ISO-8859-1 A - 41 hex >> UTF-8 A - 41 hex >> >> BEL (a bell sound, often invoked by a Ctrl-g keyboard chord) >> In ASCII, not in ISO 8859-1 >> ASCII : BEL - 07 hex >> ISO-8859-1 : 07 hex is not a valid character code >> UTF-8 : BEL - 07 hex >> >> £ (GBP currency symbol) >> In ISO-8859-1, not in ASCII >> ASCII : A3 hex is not a valid ASCII code >> UTF-8: £ - A3 hex >> ISO-8859-1: £ - A3 hex >> >> Upper Case C cedilla >> In ISO-8859-1, not in ASCII, in UTF-8 as a compatibility codepoint >> *and* a composed set of codepoints >> ASCII : C7 hex is not a valid ASCII character code >> ISO-8859-1 : Upper Case C cedilla - C7 hex >> UTF-8 : Upper Case C cedilla (compatibility codepoint) - C7 hex >> Unicode preferred Upper Case C cedilla (composed set of codepoints) >> Upper case C 0043 hex (Upper case C) >> followed by >> cedilla 00B8 hex (cedilla) >> >> 13) For any valid ASCII string *and* for any valid ISO-8859-1 string, >> aByteString is completely adequate for editing and display. >> >> 14) When sorting any valid ASCII string *or* any valid ISO-8859-1 >> string, upper and lower case versions of the same character will be >> treated differently. >> >> 15) When sorting any valid ISO-8859-1 string containing >> letter+diacritic combination glyphs or ligature combination glyphs, >> the glyphs in combination will treated differently to a "plain" glyph >> of the character >> i.e. "C" and "C cedilla" will be treated very differently. "ß" and >> "fs" will be treated very differently. >> >> 16) Different nations have different rules about where diacritic-ed >> characted and ligature pairs should be placed when in alphabetical >> order. >> >> 17) Some nations even have multiple standards - e.g. surnames >> beginning either "M superscript-c" or "M superscript-a superscript-c" >> are treated as beginning equivalently in UK phone directories, but not >> in other situations. >> >> >> Some practical upshots >> ================== >> >> 1) Cuis and its ISO-8859-1 encoding is *exactly* the same as UTF-8, >> for any single character it considers valid, or any ByteString it has >> made up of characters it considers valid. >> >> 2) Any ByteString is valid UTF-8 in any of Squeak, Pharo, Cuis or any >> other Smalltalk with a single byte ByteString following ASCII or >> ISO-8859-1. >> >> 3) Any Smalltalk (or derivative language) using ByteString can >> immediately consider it's ByteString as valid UTF-8, as long as it >> also considers the ByteSring as valid ASCII and/or ISO-8859-1. >> >> 4) All of those can be successfully exported to any system using UTF-8 >> (e.g. HTML). >> >> 5) To successfully *accept* all UTF-8 we much be able to do either: >> a) accept UTF-8 strings with composed characters >> b) convert UTF-8 strings with composed characters into UTF-8 strings >> that use *only* compatibility codepoints. >> >> >> Class + protocol proposals >> >> >> >> a Utf8CompatibilityString class. >> >> asByteString - ensure only compatibility codepoints are used. >> Ensure it doews not encode characters above 00FF hex. >> >> asIso8859String - ensures only compatibility codepoints are used, >> and that the characters are each valid ISO 8859-1 >> >> asAsciiString - ensures only characters 00hex - 7F hex are used. >> >> asUtf8ComposedIso8859String - ensures all compatibility codepoints >> are expanded into small OrderedCollections of codepoints >> >> a Utf8ComposedIso8859String class - will provide sortable and >> comparable UTF8 strings of all ASCII and ISO 8859-1 strings. >> >> Then a Utf8SortableCollection class - a collection of >> Utf8ComposedIso8859Strings words and phrases. >> >> Custom sortBlocks will define the applicable sort order. >> >> We can create a collection... a Dictionary, thinking about it, of >> named, prefabricated sortBlocks. >> >> This will work for all UTF8 strings of ISO-8859-1 and ASCII strings. >> >> If anyone has better names for the classes, please let me know. >> >> If anyone else wants to help >> - build these, >> - create SUnit tests for these >> - write documentation for these >> Please let me know. >> >> n.b. I have had absolutely no experience of Ropes. >> >> My own background with this stuff: In the early 90's as a Project >> Manager implementing office automation systems across a global >> company, with offices in the Americas, Western, Eastern and Central >> Europe, (including Slavic and Cyrillic users) nations, Japan and >> China. The mission-critical application was word-processing. >> >> Our offices were spread around the globe, and we needed those offices >> to successfully exchange documents with their sister offices, and with >> the customers in each region the offices were in. >> >> Unicode was then new, and our platform supplier was the NeXT >> Corporation, who had been founder members in of the Unicode Consortium >> in 1990. >> >> So far: I've read the latest version of the Unicode Standard (v8.0). >> This is freely downloadable. >> I've purchased a paper copy of an earlier release. New releases >> typically consist additional codespaces (i.e. alphabets). So old >> copies are useful, as well as cheap. (Paper copies of version 4.0 >> are available second-hand for < $10 / €10). >> >> The typical change with each release is the addition of further >> codespaces (i.e alphabets (more or less) ), so you don't lose a lot. >> (I'll be going through my V4.0 just to make sure) >> >> Cheers, >> Euan >> >> >> >> >> On 5 December 2015 at 13:08, stepharo <[hidden email]> wrote: >>> Hi EuanM >>> >>> Le 4/12/15 12:42, EuanM a écrit : >>>> >>>> I'm currently groping my way to seeing how feature-complete our >>>> Unicode support is. I am doing this to establish what still needs to >>>> be done to provide full Unicode support. >>> >>> >>> this is great. Thanks for pushing this. I wrote and collected some roadmap >>> (analyses on different topics) >>> on the pharo github project feel free to add this one there. >>>> >>>> >>>> This seems to me to be an area where it would be best to write it >>>> once, and then have the same codebase incorporated into the Smalltalks >>>> that most share a common ancestry. >>>> >>>> I am keen to get: equality-testing for strings; sortability for >>>> strings which have ligatures and diacritic characters; and correct >>>> round-tripping of data. >>> >>> Go! >>> My suggestion is >>> start small >>> make steady progress >>> write tests >>> commit often :) >>> >>> Stef >>> >>> What is the french phoneBook ordering because this is the first time I hear >>> about it. >>> >>>> >>>> Call to action: >>>> ========== >>>> >>>> If you have comments on these proposals - such as "but we already have >>>> that facility" or "the reason we do not have these facilities is >>>> because they are dog-slow" - please let me know them. >>>> >>>> If you would like to help out, please let me know. >>>> >>>> If you have Unicode experience and expertise, and would like to be, or >>>> would be willing to be, in the 'council of experts' for this project, >>>> please let me know. >>>> >>>> If you have comments or ideas on anything mentioned in this email >>>> >>>> In the first instance, the initiative's website will be: >>>> http://smalltalk.uk.to/unicode.html >>>> >>>> I have created a SqueakSource.com project called UnicodeSupport >>>> >>>> I want to avoid re-inventing any facilities which already exist. >>>> Except where they prevent us reaching the goals of: >>>> - sortable UTF8 strings >>>> - sortable UTF16 strings >>>> - equivalence testing of 2 UTF8 strings >>>> - equivalence testing of 2 UTF16 strings >>>> - round-tripping UTF8 strings through Smalltalk >>>> - roundtripping UTF16 strings through Smalltalk. >>>> As I understand it, we have limited Unicode support atm. >>>> >>>> Current state of play >>>> =============== >>>> ByteString gets converted to WideString when need is automagically >>>> detected. >>>> >>>> Is there anything else that currently exists? >>>> >>>> Definition of Terms >>>> ============== >>>> A quick definition of terms before I go any further: >>>> >>>> Standard terms from the Unicode standard >>>> =============================== >>>> a compatibility character : an additional encoding of a *normal* >>>> character, for compatibility and round-trip conversion purposes. For >>>> instance, a 1-byte encoding of a Latin character with a diacritic. >>>> >>>> Made-up terms >>>> ============ >>>> a convenience codepoint : a single codepoint which represents an item >>>> that is also encoded as a string of codepoints. >>>> >>>> (I tend to use the terms compatibility character and compatibility >>>> codepoint interchangably. The standard only refers to them as >>>> compatibility characters. However, the standard is determined to >>>> emphasise that characters are abstract and that codepoints are >>>> concrete. So I think it is often more useful and productive to think >>>> of compatibility or convenience codepoints). >>>> >>>> a composed character : a character made up of several codepoints >>>> >>>> Unicode encoding explained >>>> ===================== >>>> A convenience codepoint can therefore be thought of as a code point >>>> used for a character which also has a composed form. >>>> >>>> The way Unicode works is that sometimes you can encode a character in >>>> one byte, sometimes not. Sometimes you can encode it in two bytes, >>>> sometimes not. >>>> >>>> You can therefore have a long stream of ASCII which is single-byte >>>> Unicode. If there is an occasional Cyrillic or Greek character in the >>>> stream, it would be represented either by a compatibility character or >>>> by a multi-byte combination. >>>> >>>> Using compatibility characters can prevent proper sorting and >>>> equivalence testing. >>>> >>>> Using "pure" Unicode, ie. "normal encodings", can cause compatibility >>>> and round-tripping probelms. Although avoiding them can *also* cause >>>> compatibility issues and round-tripping problems. >>>> >>>> Currently my thinking is: >>>> >>>> a Utf8String class >>>> an Ordered collection, with 1 byte characters as the modal element, >>>> but short arrays of wider strings where necessary >>>> a Utf16String class >>>> an Ordered collection, with 2 byte characters as the modal element, >>>> but short arrays of wider strings >>>> beginning with a 2-byte endianness indicator. >>>> >>>> Utf8Strings sometimes need to be sortable, and sometimes need to be >>>> compatible. >>>> >>>> So my thinking is that Utf8String will contain convenience codepoints, >>>> for round-tripping. And where there are multiple convenience >>>> codepoints for a character, that it standardises on one. >>>> >>>> And that there is a Utf8SortableString which uses *only* normal >>>> characters. >>>> >>>> We then need methods to convert between the two. >>>> >>>> aUtf8String asUtf8SortableString >>>> >>>> and >>>> >>>> aUtf8SortableString asUtf8String >>>> >>>> >>>> Sort orders are culture and context dependent - Sweden and Germany >>>> have different sort orders for the same diacritic-ed characters. Some >>>> countries have one order in general usage, and another for specific >>>> usages, such as phone directories (e.g. UK and France) >>>> >>>> Similarly for Utf16 : Utf16String and Utf16SortableString and >>>> conversion methods >>>> >>>> A list of sorted words would be a SortedCollection, and there could be >>>> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder, >>>> seOrder, ukOrder, etc >>>> >>>> along the lines of >>>> aListOfWords := SortedCollection sortBlock: deOrder >>>> >>>> If a word is either a Utf8SortableString, or a well-formed Utf8String, >>>> then we can perform equivalence testing on them trivially. >>>> >>>> To make sure a Utf8String is well formed, we would need to have a way >>>> of cleaning up any convenience codepoints which were valid, but which >>>> were for a character which has multiple equally-valid alternative >>>> convenience codepoints, and for which the string currently had the >>>> "wrong" convenience codepoint. (i.e for any character with valid >>>> alternative convenience codepoints, we would choose one to be in the >>>> well-formed Utf8String, and we would need a method for cleaning the >>>> alternative convenience codepoints out of the string, and replacing >>>> them with the chosen approved convenience codepoint. >>>> >>>> aUtf8String cleanUtf8String >>>> >>>> With WideString, a lot of the issues disappear - except >>>> round-tripping(although I'm sure I have seen something recently about >>>> 4-byte strings that also have an additional bit. Which would make >>>> some Unicode characters 5-bytes long.) >>>> >>>> >>>> (I'm starting to zone out now - if I've overlooked anything - obvious, >>>> subtle, or somewhere in between, please let me know) >>>> >>>> Cheers, >>>> Euan >>>> >>>> >>> >>> > >> On 07 Dec 2015, at 04:21, EuanM <[hidden email]> wrote: >> >> This a long email. A *lot* of it is encapsulated in the Venn diagram both: >> http://smalltalk.uk.to/unicode-utf8.html >> and my Smalltalk in Small Steps blog at: >> http://smalltalkinsmallsteps.blogspot.co.uk/2015/12/utf-8-for-cuis-pharo-and-squeak.html >> >> My current thinking, and understanding. >> ============================== >> >> 0) a) ASCII and ISO-8859-1 consist of characters encoded in 1 byte. >> b) UTF-8 can encode all of those characters in 1 byte, but can >> prefer some of them to be encoded as sequences of multiple bytes. And >> can encode additional characters as sequences of multiple bytes. >> >> 1) Smalltalk has long had multiple String classes. >> >> 2) Any UTF16 Unicode codepoint which has a codepoint of 00nn hex >> is encoded as a UTF-8 codepoint of nn hex. >> >> 3) All valid ISO-8859-1 characters have a character code between 20 >> hex and 7E hex, or between A0 hex and FF hex. >> https://en.wikipedia.org/wiki/ISO/IEC_8859-1 >> >> 4) All valid ASCII characters have a character code between 00 hex and 7E hex. >> https://en.wikipedia.org/wiki/ASCII >> >> >> 5) a) All character codes which are defined within ISO-8859-1 and also >> defined within ASCII. (i.e. character codes 20 hex to 7E hex) are >> defined identically in both. >> >> b) All printable ASCII characters are defined identically in both >> ASCII and ISO-8859-1 >> >> 6) All character codes defined in ASCII (00 hex to 7E hex) are >> defined identically in Unicode UTF-8. >> >> 7) All character codes defined in ISO-8859-1 (20 hex - 7E hex ; A0 hex >> - FF hex ) are defined identically in UTF-8. >> >> 8) => some Unicode codepoints map to both ASCII and ISO-8859-1. >> all ASCII maps 1:1 to Unicode UTF-8 >> all ISO-8859-1 maps 1:1 to Unicode UTF-8 >> >> 9) All ByteStrings elements which are either a valid ISO-8859-1 >> character or a valid ASCII character are *also* a valid UTF-8 >> character. >> >> 10) ISO-8859-1 characters representing a character with a diacritic, >> or a two-character ligature, have no ASCII equivalent. In Unicode >> UTF-8, those character codes which are representing compound glyphs, >> are called "compatibility codepoints". >> >> 11) The preferred Unicode representation of the characters which have >> compatibility codepoints is as a a short set of codepoints >> representing the characters which are combined together to form the >> glyph of the convenience codepoint, as a sequence of bytes >> representing the component characters. >> >> >> 12) Some concrete examples: >> >> A - aka Upper Case A >> In ASCII, in ISO 8859-1 >> ASCII A - 41 hex >> ISO-8859-1 A - 41 hex >> UTF-8 A - 41 hex >> >> BEL (a bell sound, often invoked by a Ctrl-g keyboard chord) >> In ASCII, not in ISO 8859-1 >> ASCII : BEL - 07 hex >> ISO-8859-1 : 07 hex is not a valid character code >> UTF-8 : BEL - 07 hex >> >> £ (GBP currency symbol) >> In ISO-8859-1, not in ASCII >> ASCII : A3 hex is not a valid ASCII code >> UTF-8: £ - A3 hex >> ISO-8859-1: £ - A3 hex >> >> Upper Case C cedilla >> In ISO-8859-1, not in ASCII, in UTF-8 as a compatibility codepoint >> *and* a composed set of codepoints >> ASCII : C7 hex is not a valid ASCII character code >> ISO-8859-1 : Upper Case C cedilla - C7 hex >> UTF-8 : Upper Case C cedilla (compatibility codepoint) - C7 hex >> Unicode preferred Upper Case C cedilla (composed set of codepoints) >> Upper case C 0043 hex (Upper case C) >> followed by >> cedilla 00B8 hex (cedilla) >> >> 13) For any valid ASCII string *and* for any valid ISO-8859-1 string, >> aByteString is completely adequate for editing and display. >> >> 14) When sorting any valid ASCII string *or* any valid ISO-8859-1 >> string, upper and lower case versions of the same character will be >> treated differently. >> >> 15) When sorting any valid ISO-8859-1 string containing >> letter+diacritic combination glyphs or ligature combination glyphs, >> the glyphs in combination will treated differently to a "plain" glyph >> of the character >> i.e. "C" and "C cedilla" will be treated very differently. "ß" and >> "fs" will be treated very differently. >> >> 16) Different nations have different rules about where diacritic-ed >> characted and ligature pairs should be placed when in alphabetical >> order. >> >> 17) Some nations even have multiple standards - e.g. surnames >> beginning either "M superscript-c" or "M superscript-a superscript-c" >> are treated as beginning equivalently in UK phone directories, but not >> in other situations. >> >> >> Some practical upshots >> ================== >> >> 1) Cuis and its ISO-8859-1 encoding is *exactly* the same as UTF-8, >> for any single character it considers valid, or any ByteString it has >> made up of characters it considers valid. >> >> 2) Any ByteString is valid UTF-8 in any of Squeak, Pharo, Cuis or any >> other Smalltalk with a single byte ByteString following ASCII or >> ISO-8859-1. >> >> 3) Any Smalltalk (or derivative language) using ByteString can >> immediately consider it's ByteString as valid UTF-8, as long as it >> also considers the ByteSring as valid ASCII and/or ISO-8859-1. >> >> 4) All of those can be successfully exported to any system using UTF-8 >> (e.g. HTML). >> >> 5) To successfully *accept* all UTF-8 we much be able to do either: >> a) accept UTF-8 strings with composed characters >> b) convert UTF-8 strings with composed characters into UTF-8 strings >> that use *only* compatibility codepoints. >> >> >> Class + protocol proposals >> >> >> >> a Utf8CompatibilityString class. >> >> asByteString - ensure only compatibility codepoints are used. >> Ensure it doews not encode characters above 00FF hex. >> >> asIso8859String - ensures only compatibility codepoints are used, >> and that the characters are each valid ISO 8859-1 >> >> asAsciiString - ensures only characters 00hex - 7F hex are used. >> >> asUtf8ComposedIso8859String - ensures all compatibility codepoints >> are expanded into small OrderedCollections of codepoints >> >> a Utf8ComposedIso8859String class - will provide sortable and >> comparable UTF8 strings of all ASCII and ISO 8859-1 strings. >> >> Then a Utf8SortableCollection class - a collection of >> Utf8ComposedIso8859Strings words and phrases. >> >> Custom sortBlocks will define the applicable sort order. >> >> We can create a collection... a Dictionary, thinking about it, of >> named, prefabricated sortBlocks. >> >> This will work for all UTF8 strings of ISO-8859-1 and ASCII strings. >> >> If anyone has better names for the classes, please let me know. >> >> If anyone else wants to help >> - build these, >> - create SUnit tests for these >> - write documentation for these >> Please let me know. >> >> n.b. I have had absolutely no experience of Ropes. >> >> My own background with this stuff: In the early 90's as a Project >> Manager implementing office automation systems across a global >> company, with offices in the Americas, Western, Eastern and Central >> Europe, (including Slavic and Cyrillic users) nations, Japan and >> China. The mission-critical application was word-processing. >> >> Our offices were spread around the globe, and we needed those offices >> to successfully exchange documents with their sister offices, and with >> the customers in each region the offices were in. >> >> Unicode was then new, and our platform supplier was the NeXT >> Corporation, who had been founder members in of the Unicode Consortium >> in 1990. >> >> So far: I've read the latest version of the Unicode Standard (v8.0). >> This is freely downloadable. >> I've purchased a paper copy of an earlier release. New releases >> typically consist additional codespaces (i.e. alphabets). So old >> copies are useful, as well as cheap. (Paper copies of version 4.0 >> are available second-hand for < $10 / €10). >> >> The typical change with each release is the addition of further >> codespaces (i.e alphabets (more or less) ), so you don't lose a lot. >> (I'll be going through my V4.0 just to make sure) >> >> Cheers, >> Euan >> >> >> >> >> On 5 December 2015 at 13:08, stepharo <[hidden email]> wrote: >>> Hi EuanM >>> >>> Le 4/12/15 12:42, EuanM a écrit : >>>> >>>> I'm currently groping my way to seeing how feature-complete our >>>> Unicode support is. I am doing this to establish what still needs to >>>> be done to provide full Unicode support. >>> >>> >>> this is great. Thanks for pushing this. I wrote and collected some roadmap >>> (analyses on different topics) >>> on the pharo github project feel free to add this one there. >>>> >>>> >>>> This seems to me to be an area where it would be best to write it >>>> once, and then have the same codebase incorporated into the Smalltalks >>>> that most share a common ancestry. >>>> >>>> I am keen to get: equality-testing for strings; sortability for >>>> strings which have ligatures and diacritic characters; and correct >>>> round-tripping of data. >>> >>> Go! >>> My suggestion is >>> start small >>> make steady progress >>> write tests >>> commit often :) >>> >>> Stef >>> >>> What is the french phoneBook ordering because this is the first time I hear >>> about it. >>> >>>> >>>> Call to action: >>>> ========== >>>> >>>> If you have comments on these proposals - such as "but we already have >>>> that facility" or "the reason we do not have these facilities is >>>> because they are dog-slow" - please let me know them. >>>> >>>> If you would like to help out, please let me know. >>>> >>>> If you have Unicode experience and expertise, and would like to be, or >>>> would be willing to be, in the 'council of experts' for this project, >>>> please let me know. >>>> >>>> If you have comments or ideas on anything mentioned in this email >>>> >>>> In the first instance, the initiative's website will be: >>>> http://smalltalk.uk.to/unicode.html >>>> >>>> I have created a SqueakSource.com project called UnicodeSupport >>>> >>>> I want to avoid re-inventing any facilities which already exist. >>>> Except where they prevent us reaching the goals of: >>>> - sortable UTF8 strings >>>> - sortable UTF16 strings >>>> - equivalence testing of 2 UTF8 strings >>>> - equivalence testing of 2 UTF16 strings >>>> - round-tripping UTF8 strings through Smalltalk >>>> - roundtripping UTF16 strings through Smalltalk. >>>> As I understand it, we have limited Unicode support atm. >>>> >>>> Current state of play >>>> =============== >>>> ByteString gets converted to WideString when need is automagically >>>> detected. >>>> >>>> Is there anything else that currently exists? >>>> >>>> Definition of Terms >>>> ============== >>>> A quick definition of terms before I go any further: >>>> >>>> Standard terms from the Unicode standard >>>> =============================== >>>> a compatibility character : an additional encoding of a *normal* >>>> character, for compatibility and round-trip conversion purposes. For >>>> instance, a 1-byte encoding of a Latin character with a diacritic. >>>> >>>> Made-up terms >>>> ============ >>>> a convenience codepoint : a single codepoint which represents an item >>>> that is also encoded as a string of codepoints. >>>> >>>> (I tend to use the terms compatibility character and compatibility >>>> codepoint interchangably. The standard only refers to them as >>>> compatibility characters. However, the standard is determined to >>>> emphasise that characters are abstract and that codepoints are >>>> concrete. So I think it is often more useful and productive to think >>>> of compatibility or convenience codepoints). >>>> >>>> a composed character : a character made up of several codepoints >>>> >>>> Unicode encoding explained >>>> ===================== >>>> A convenience codepoint can therefore be thought of as a code point >>>> used for a character which also has a composed form. >>>> >>>> The way Unicode works is that sometimes you can encode a character in >>>> one byte, sometimes not. Sometimes you can encode it in two bytes, >>>> sometimes not. >>>> >>>> You can therefore have a long stream of ASCII which is single-byte >>>> Unicode. If there is an occasional Cyrillic or Greek character in the >>>> stream, it would be represented either by a compatibility character or >>>> by a multi-byte combination. >>>> >>>> Using compatibility characters can prevent proper sorting and >>>> equivalence testing. >>>> >>>> Using "pure" Unicode, ie. "normal encodings", can cause compatibility >>>> and round-tripping probelms. Although avoiding them can *also* cause >>>> compatibility issues and round-tripping problems. >>>> >>>> Currently my thinking is: >>>> >>>> a Utf8String class >>>> an Ordered collection, with 1 byte characters as the modal element, >>>> but short arrays of wider strings where necessary >>>> a Utf16String class >>>> an Ordered collection, with 2 byte characters as the modal element, >>>> but short arrays of wider strings >>>> beginning with a 2-byte endianness indicator. >>>> >>>> Utf8Strings sometimes need to be sortable, and sometimes need to be >>>> compatible. >>>> >>>> So my thinking is that Utf8String will contain convenience codepoints, >>>> for round-tripping. And where there are multiple convenience >>>> codepoints for a character, that it standardises on one. >>>> >>>> And that there is a Utf8SortableString which uses *only* normal >>>> characters. >>>> >>>> We then need methods to convert between the two. >>>> >>>> aUtf8String asUtf8SortableString >>>> >>>> and >>>> >>>> aUtf8SortableString asUtf8String >>>> >>>> >>>> Sort orders are culture and context dependent - Sweden and Germany >>>> have different sort orders for the same diacritic-ed characters. Some >>>> countries have one order in general usage, and another for specific >>>> usages, such as phone directories (e.g. UK and France) >>>> >>>> Similarly for Utf16 : Utf16String and Utf16SortableString and >>>> conversion methods >>>> >>>> A list of sorted words would be a SortedCollection, and there could be >>>> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder, >>>> seOrder, ukOrder, etc >>>> >>>> along the lines of >>>> aListOfWords := SortedCollection sortBlock: deOrder >>>> >>>> If a word is either a Utf8SortableString, or a well-formed Utf8String, >>>> then we can perform equivalence testing on them trivially. >>>> >>>> To make sure a Utf8String is well formed, we would need to have a way >>>> of cleaning up any convenience codepoints which were valid, but which >>>> were for a character which has multiple equally-valid alternative >>>> convenience codepoints, and for which the string currently had the >>>> "wrong" convenience codepoint. (i.e for any character with valid >>>> alternative convenience codepoints, we would choose one to be in the >>>> well-formed Utf8String, and we would need a method for cleaning the >>>> alternative convenience codepoints out of the string, and replacing >>>> them with the chosen approved convenience codepoint. >>>> >>>> aUtf8String cleanUtf8String >>>> >>>> With WideString, a lot of the issues disappear - except >>>> round-tripping(although I'm sure I have seen something recently about >>>> 4-byte strings that also have an additional bit. Which would make >>>> some Unicode characters 5-bytes long.) >>>> >>>> >>>> (I'm starting to zone out now - if I've overlooked anything - obvious, >>>> subtle, or somewhere in between, please let me know) >>>> >>>> Cheers, >>>> Euan >>>> >>>> >>> >>> >> > > |
In reply to this post by EuanM
As for the issue of lower case e acute, it is
compatibility codepoint 00e9 hex and therefore encodable in UTF-8 as compatibility codepoint e9 hex and as the composed character #(0065 00b4) (all in hex) and as the same composed character as both #(feff 0065 00b4) and #(ffef 0065 00b4) when endianness markers are included and as I understand it, should also be legitimate to encode it in UTF8 as a composed character #(65 b4) (all hex) etc On 7 December 2015 at 08:27, Sven Van Caekenberghe <[hidden email]> wrote: > I am sorry but one of your basic assumptions is completely wrong: > > 'Les élèves Français' encodeWith: #iso99591. > > => #[76 101 115 32 233 108 232 118 101 115 32 70 114 97 110 231 97 105 115] > > 'Les élèves Français' utf8Encoded. > > => #[76 101 115 32 195 169 108 195 168 118 101 115 32 70 114 97 110 195 167 97 105 115] > > ISO-9959-1 (~Latin1) is NOT AT ALL identical to UTF-8 in its upper, non-ACII part !! > > Or shorter, $é is encoded in ISO-9959-1 as #[233], but as #[195 169] in UTF-8. > > So more than half the points you make, or the facts that you state, are thus plain wrong. > > The only thing that is correct is that the code points are equal, but that is not the same as the encoding ! > > From this I am inclined to conclude that you do not fundamentally understand how UTF-8 works, which does not strike me as good basis to design something called a UTF8String. > > Sorry. > > PS: Note also that Cuis' choice to use ISO-9959-1 only is pretty limiting in a Unicode world. > >> On 07 Dec 2015, at 04:21, EuanM <[hidden email]> wrote: >> >> This a long email. A *lot* of it is encapsulated in the Venn diagram both: >> http://smalltalk.uk.to/unicode-utf8.html >> and my Smalltalk in Small Steps blog at: >> http://smalltalkinsmallsteps.blogspot.co.uk/2015/12/utf-8-for-cuis-pharo-and-squeak.html >> >> My current thinking, and understanding. >> ============================== >> >> 0) a) ASCII and ISO-8859-1 consist of characters encoded in 1 byte. >> b) UTF-8 can encode all of those characters in 1 byte, but can >> prefer some of them to be encoded as sequences of multiple bytes. And >> can encode additional characters as sequences of multiple bytes. >> >> 1) Smalltalk has long had multiple String classes. >> >> 2) Any UTF16 Unicode codepoint which has a codepoint of 00nn hex >> is encoded as a UTF-8 codepoint of nn hex. >> >> 3) All valid ISO-8859-1 characters have a character code between 20 >> hex and 7E hex, or between A0 hex and FF hex. >> https://en.wikipedia.org/wiki/ISO/IEC_8859-1 >> >> 4) All valid ASCII characters have a character code between 00 hex and 7E hex. >> https://en.wikipedia.org/wiki/ASCII >> >> >> 5) a) All character codes which are defined within ISO-8859-1 and also >> defined within ASCII. (i.e. character codes 20 hex to 7E hex) are >> defined identically in both. >> >> b) All printable ASCII characters are defined identically in both >> ASCII and ISO-8859-1 >> >> 6) All character codes defined in ASCII (00 hex to 7E hex) are >> defined identically in Unicode UTF-8. >> >> 7) All character codes defined in ISO-8859-1 (20 hex - 7E hex ; A0 hex >> - FF hex ) are defined identically in UTF-8. >> >> 8) => some Unicode codepoints map to both ASCII and ISO-8859-1. >> all ASCII maps 1:1 to Unicode UTF-8 >> all ISO-8859-1 maps 1:1 to Unicode UTF-8 >> >> 9) All ByteStrings elements which are either a valid ISO-8859-1 >> character or a valid ASCII character are *also* a valid UTF-8 >> character. >> >> 10) ISO-8859-1 characters representing a character with a diacritic, >> or a two-character ligature, have no ASCII equivalent. In Unicode >> UTF-8, those character codes which are representing compound glyphs, >> are called "compatibility codepoints". >> >> 11) The preferred Unicode representation of the characters which have >> compatibility codepoints is as a a short set of codepoints >> representing the characters which are combined together to form the >> glyph of the convenience codepoint, as a sequence of bytes >> representing the component characters. >> >> >> 12) Some concrete examples: >> >> A - aka Upper Case A >> In ASCII, in ISO 8859-1 >> ASCII A - 41 hex >> ISO-8859-1 A - 41 hex >> UTF-8 A - 41 hex >> >> BEL (a bell sound, often invoked by a Ctrl-g keyboard chord) >> In ASCII, not in ISO 8859-1 >> ASCII : BEL - 07 hex >> ISO-8859-1 : 07 hex is not a valid character code >> UTF-8 : BEL - 07 hex >> >> £ (GBP currency symbol) >> In ISO-8859-1, not in ASCII >> ASCII : A3 hex is not a valid ASCII code >> UTF-8: £ - A3 hex >> ISO-8859-1: £ - A3 hex >> >> Upper Case C cedilla >> In ISO-8859-1, not in ASCII, in UTF-8 as a compatibility codepoint >> *and* a composed set of codepoints >> ASCII : C7 hex is not a valid ASCII character code >> ISO-8859-1 : Upper Case C cedilla - C7 hex >> UTF-8 : Upper Case C cedilla (compatibility codepoint) - C7 hex >> Unicode preferred Upper Case C cedilla (composed set of codepoints) >> Upper case C 0043 hex (Upper case C) >> followed by >> cedilla 00B8 hex (cedilla) >> >> 13) For any valid ASCII string *and* for any valid ISO-8859-1 string, >> aByteString is completely adequate for editing and display. >> >> 14) When sorting any valid ASCII string *or* any valid ISO-8859-1 >> string, upper and lower case versions of the same character will be >> treated differently. >> >> 15) When sorting any valid ISO-8859-1 string containing >> letter+diacritic combination glyphs or ligature combination glyphs, >> the glyphs in combination will treated differently to a "plain" glyph >> of the character >> i.e. "C" and "C cedilla" will be treated very differently. "ß" and >> "fs" will be treated very differently. >> >> 16) Different nations have different rules about where diacritic-ed >> characted and ligature pairs should be placed when in alphabetical >> order. >> >> 17) Some nations even have multiple standards - e.g. surnames >> beginning either "M superscript-c" or "M superscript-a superscript-c" >> are treated as beginning equivalently in UK phone directories, but not >> in other situations. >> >> >> Some practical upshots >> ================== >> >> 1) Cuis and its ISO-8859-1 encoding is *exactly* the same as UTF-8, >> for any single character it considers valid, or any ByteString it has >> made up of characters it considers valid. >> >> 2) Any ByteString is valid UTF-8 in any of Squeak, Pharo, Cuis or any >> other Smalltalk with a single byte ByteString following ASCII or >> ISO-8859-1. >> >> 3) Any Smalltalk (or derivative language) using ByteString can >> immediately consider it's ByteString as valid UTF-8, as long as it >> also considers the ByteSring as valid ASCII and/or ISO-8859-1. >> >> 4) All of those can be successfully exported to any system using UTF-8 >> (e.g. HTML). >> >> 5) To successfully *accept* all UTF-8 we much be able to do either: >> a) accept UTF-8 strings with composed characters >> b) convert UTF-8 strings with composed characters into UTF-8 strings >> that use *only* compatibility codepoints. >> >> >> Class + protocol proposals >> >> >> >> a Utf8CompatibilityString class. >> >> asByteString - ensure only compatibility codepoints are used. >> Ensure it doews not encode characters above 00FF hex. >> >> asIso8859String - ensures only compatibility codepoints are used, >> and that the characters are each valid ISO 8859-1 >> >> asAsciiString - ensures only characters 00hex - 7F hex are used. >> >> asUtf8ComposedIso8859String - ensures all compatibility codepoints >> are expanded into small OrderedCollections of codepoints >> >> a Utf8ComposedIso8859String class - will provide sortable and >> comparable UTF8 strings of all ASCII and ISO 8859-1 strings. >> >> Then a Utf8SortableCollection class - a collection of >> Utf8ComposedIso8859Strings words and phrases. >> >> Custom sortBlocks will define the applicable sort order. >> >> We can create a collection... a Dictionary, thinking about it, of >> named, prefabricated sortBlocks. >> >> This will work for all UTF8 strings of ISO-8859-1 and ASCII strings. >> >> If anyone has better names for the classes, please let me know. >> >> If anyone else wants to help >> - build these, >> - create SUnit tests for these >> - write documentation for these >> Please let me know. >> >> n.b. I have had absolutely no experience of Ropes. >> >> My own background with this stuff: In the early 90's as a Project >> Manager implementing office automation systems across a global >> company, with offices in the Americas, Western, Eastern and Central >> Europe, (including Slavic and Cyrillic users) nations, Japan and >> China. The mission-critical application was word-processing. >> >> Our offices were spread around the globe, and we needed those offices >> to successfully exchange documents with their sister offices, and with >> the customers in each region the offices were in. >> >> Unicode was then new, and our platform supplier was the NeXT >> Corporation, who had been founder members in of the Unicode Consortium >> in 1990. >> >> So far: I've read the latest version of the Unicode Standard (v8.0). >> This is freely downloadable. >> I've purchased a paper copy of an earlier release. New releases >> typically consist additional codespaces (i.e. alphabets). So old >> copies are useful, as well as cheap. (Paper copies of version 4.0 >> are available second-hand for < $10 / €10). >> >> The typical change with each release is the addition of further >> codespaces (i.e alphabets (more or less) ), so you don't lose a lot. >> (I'll be going through my V4.0 just to make sure) >> >> Cheers, >> Euan >> >> >> >> >> On 5 December 2015 at 13:08, stepharo <[hidden email]> wrote: >>> Hi EuanM >>> >>> Le 4/12/15 12:42, EuanM a écrit : >>>> >>>> I'm currently groping my way to seeing how feature-complete our >>>> Unicode support is. I am doing this to establish what still needs to >>>> be done to provide full Unicode support. >>> >>> >>> this is great. Thanks for pushing this. I wrote and collected some roadmap >>> (analyses on different topics) >>> on the pharo github project feel free to add this one there. >>>> >>>> >>>> This seems to me to be an area where it would be best to write it >>>> once, and then have the same codebase incorporated into the Smalltalks >>>> that most share a common ancestry. >>>> >>>> I am keen to get: equality-testing for strings; sortability for >>>> strings which have ligatures and diacritic characters; and correct >>>> round-tripping of data. >>> >>> Go! >>> My suggestion is >>> start small >>> make steady progress >>> write tests >>> commit often :) >>> >>> Stef >>> >>> What is the french phoneBook ordering because this is the first time I hear >>> about it. >>> >>>> >>>> Call to action: >>>> ========== >>>> >>>> If you have comments on these proposals - such as "but we already have >>>> that facility" or "the reason we do not have these facilities is >>>> because they are dog-slow" - please let me know them. >>>> >>>> If you would like to help out, please let me know. >>>> >>>> If you have Unicode experience and expertise, and would like to be, or >>>> would be willing to be, in the 'council of experts' for this project, >>>> please let me know. >>>> >>>> If you have comments or ideas on anything mentioned in this email >>>> >>>> In the first instance, the initiative's website will be: >>>> http://smalltalk.uk.to/unicode.html >>>> >>>> I have created a SqueakSource.com project called UnicodeSupport >>>> >>>> I want to avoid re-inventing any facilities which already exist. >>>> Except where they prevent us reaching the goals of: >>>> - sortable UTF8 strings >>>> - sortable UTF16 strings >>>> - equivalence testing of 2 UTF8 strings >>>> - equivalence testing of 2 UTF16 strings >>>> - round-tripping UTF8 strings through Smalltalk >>>> - roundtripping UTF16 strings through Smalltalk. >>>> As I understand it, we have limited Unicode support atm. >>>> >>>> Current state of play >>>> =============== >>>> ByteString gets converted to WideString when need is automagically >>>> detected. >>>> >>>> Is there anything else that currently exists? >>>> >>>> Definition of Terms >>>> ============== >>>> A quick definition of terms before I go any further: >>>> >>>> Standard terms from the Unicode standard >>>> =============================== >>>> a compatibility character : an additional encoding of a *normal* >>>> character, for compatibility and round-trip conversion purposes. For >>>> instance, a 1-byte encoding of a Latin character with a diacritic. >>>> >>>> Made-up terms >>>> ============ >>>> a convenience codepoint : a single codepoint which represents an item >>>> that is also encoded as a string of codepoints. >>>> >>>> (I tend to use the terms compatibility character and compatibility >>>> codepoint interchangably. The standard only refers to them as >>>> compatibility characters. However, the standard is determined to >>>> emphasise that characters are abstract and that codepoints are >>>> concrete. So I think it is often more useful and productive to think >>>> of compatibility or convenience codepoints). >>>> >>>> a composed character : a character made up of several codepoints >>>> >>>> Unicode encoding explained >>>> ===================== >>>> A convenience codepoint can therefore be thought of as a code point >>>> used for a character which also has a composed form. >>>> >>>> The way Unicode works is that sometimes you can encode a character in >>>> one byte, sometimes not. Sometimes you can encode it in two bytes, >>>> sometimes not. >>>> >>>> You can therefore have a long stream of ASCII which is single-byte >>>> Unicode. If there is an occasional Cyrillic or Greek character in the >>>> stream, it would be represented either by a compatibility character or >>>> by a multi-byte combination. >>>> >>>> Using compatibility characters can prevent proper sorting and >>>> equivalence testing. >>>> >>>> Using "pure" Unicode, ie. "normal encodings", can cause compatibility >>>> and round-tripping probelms. Although avoiding them can *also* cause >>>> compatibility issues and round-tripping problems. >>>> >>>> Currently my thinking is: >>>> >>>> a Utf8String class >>>> an Ordered collection, with 1 byte characters as the modal element, >>>> but short arrays of wider strings where necessary >>>> a Utf16String class >>>> an Ordered collection, with 2 byte characters as the modal element, >>>> but short arrays of wider strings >>>> beginning with a 2-byte endianness indicator. >>>> >>>> Utf8Strings sometimes need to be sortable, and sometimes need to be >>>> compatible. >>>> >>>> So my thinking is that Utf8String will contain convenience codepoints, >>>> for round-tripping. And where there are multiple convenience >>>> codepoints for a character, that it standardises on one. >>>> >>>> And that there is a Utf8SortableString which uses *only* normal >>>> characters. >>>> >>>> We then need methods to convert between the two. >>>> >>>> aUtf8String asUtf8SortableString >>>> >>>> and >>>> >>>> aUtf8SortableString asUtf8String >>>> >>>> >>>> Sort orders are culture and context dependent - Sweden and Germany >>>> have different sort orders for the same diacritic-ed characters. Some >>>> countries have one order in general usage, and another for specific >>>> usages, such as phone directories (e.g. UK and France) >>>> >>>> Similarly for Utf16 : Utf16String and Utf16SortableString and >>>> conversion methods >>>> >>>> A list of sorted words would be a SortedCollection, and there could be >>>> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder, >>>> seOrder, ukOrder, etc >>>> >>>> along the lines of >>>> aListOfWords := SortedCollection sortBlock: deOrder >>>> >>>> If a word is either a Utf8SortableString, or a well-formed Utf8String, >>>> then we can perform equivalence testing on them trivially. >>>> >>>> To make sure a Utf8String is well formed, we would need to have a way >>>> of cleaning up any convenience codepoints which were valid, but which >>>> were for a character which has multiple equally-valid alternative >>>> convenience codepoints, and for which the string currently had the >>>> "wrong" convenience codepoint. (i.e for any character with valid >>>> alternative convenience codepoints, we would choose one to be in the >>>> well-formed Utf8String, and we would need a method for cleaning the >>>> alternative convenience codepoints out of the string, and replacing >>>> them with the chosen approved convenience codepoint. >>>> >>>> aUtf8String cleanUtf8String >>>> >>>> With WideString, a lot of the issues disappear - except >>>> round-tripping(although I'm sure I have seen something recently about >>>> 4-byte strings that also have an additional bit. Which would make >>>> some Unicode characters 5-bytes long.) >>>> >>>> >>>> (I'm starting to zone out now - if I've overlooked anything - obvious, >>>> subtle, or somewhere in between, please let me know) >>>> >>>> Cheers, >>>> Euan >>>> >>>> >>> >>> > >> On 07 Dec 2015, at 04:21, EuanM <[hidden email]> wrote: >> >> This a long email. A *lot* of it is encapsulated in the Venn diagram both: >> http://smalltalk.uk.to/unicode-utf8.html >> and my Smalltalk in Small Steps blog at: >> http://smalltalkinsmallsteps.blogspot.co.uk/2015/12/utf-8-for-cuis-pharo-and-squeak.html >> >> My current thinking, and understanding. >> ============================== >> >> 0) a) ASCII and ISO-8859-1 consist of characters encoded in 1 byte. >> b) UTF-8 can encode all of those characters in 1 byte, but can >> prefer some of them to be encoded as sequences of multiple bytes. And >> can encode additional characters as sequences of multiple bytes. >> >> 1) Smalltalk has long had multiple String classes. >> >> 2) Any UTF16 Unicode codepoint which has a codepoint of 00nn hex >> is encoded as a UTF-8 codepoint of nn hex. >> >> 3) All valid ISO-8859-1 characters have a character code between 20 >> hex and 7E hex, or between A0 hex and FF hex. >> https://en.wikipedia.org/wiki/ISO/IEC_8859-1 >> >> 4) All valid ASCII characters have a character code between 00 hex and 7E hex. >> https://en.wikipedia.org/wiki/ASCII >> >> >> 5) a) All character codes which are defined within ISO-8859-1 and also >> defined within ASCII. (i.e. character codes 20 hex to 7E hex) are >> defined identically in both. >> >> b) All printable ASCII characters are defined identically in both >> ASCII and ISO-8859-1 >> >> 6) All character codes defined in ASCII (00 hex to 7E hex) are >> defined identically in Unicode UTF-8. >> >> 7) All character codes defined in ISO-8859-1 (20 hex - 7E hex ; A0 hex >> - FF hex ) are defined identically in UTF-8. >> >> 8) => some Unicode codepoints map to both ASCII and ISO-8859-1. >> all ASCII maps 1:1 to Unicode UTF-8 >> all ISO-8859-1 maps 1:1 to Unicode UTF-8 >> >> 9) All ByteStrings elements which are either a valid ISO-8859-1 >> character or a valid ASCII character are *also* a valid UTF-8 >> character. >> >> 10) ISO-8859-1 characters representing a character with a diacritic, >> or a two-character ligature, have no ASCII equivalent. In Unicode >> UTF-8, those character codes which are representing compound glyphs, >> are called "compatibility codepoints". >> >> 11) The preferred Unicode representation of the characters which have >> compatibility codepoints is as a a short set of codepoints >> representing the characters which are combined together to form the >> glyph of the convenience codepoint, as a sequence of bytes >> representing the component characters. >> >> >> 12) Some concrete examples: >> >> A - aka Upper Case A >> In ASCII, in ISO 8859-1 >> ASCII A - 41 hex >> ISO-8859-1 A - 41 hex >> UTF-8 A - 41 hex >> >> BEL (a bell sound, often invoked by a Ctrl-g keyboard chord) >> In ASCII, not in ISO 8859-1 >> ASCII : BEL - 07 hex >> ISO-8859-1 : 07 hex is not a valid character code >> UTF-8 : BEL - 07 hex >> >> £ (GBP currency symbol) >> In ISO-8859-1, not in ASCII >> ASCII : A3 hex is not a valid ASCII code >> UTF-8: £ - A3 hex >> ISO-8859-1: £ - A3 hex >> >> Upper Case C cedilla >> In ISO-8859-1, not in ASCII, in UTF-8 as a compatibility codepoint >> *and* a composed set of codepoints >> ASCII : C7 hex is not a valid ASCII character code >> ISO-8859-1 : Upper Case C cedilla - C7 hex >> UTF-8 : Upper Case C cedilla (compatibility codepoint) - C7 hex >> Unicode preferred Upper Case C cedilla (composed set of codepoints) >> Upper case C 0043 hex (Upper case C) >> followed by >> cedilla 00B8 hex (cedilla) >> >> 13) For any valid ASCII string *and* for any valid ISO-8859-1 string, >> aByteString is completely adequate for editing and display. >> >> 14) When sorting any valid ASCII string *or* any valid ISO-8859-1 >> string, upper and lower case versions of the same character will be >> treated differently. >> >> 15) When sorting any valid ISO-8859-1 string containing >> letter+diacritic combination glyphs or ligature combination glyphs, >> the glyphs in combination will treated differently to a "plain" glyph >> of the character >> i.e. "C" and "C cedilla" will be treated very differently. "ß" and >> "fs" will be treated very differently. >> >> 16) Different nations have different rules about where diacritic-ed >> characted and ligature pairs should be placed when in alphabetical >> order. >> >> 17) Some nations even have multiple standards - e.g. surnames >> beginning either "M superscript-c" or "M superscript-a superscript-c" >> are treated as beginning equivalently in UK phone directories, but not >> in other situations. >> >> >> Some practical upshots >> ================== >> >> 1) Cuis and its ISO-8859-1 encoding is *exactly* the same as UTF-8, >> for any single character it considers valid, or any ByteString it has >> made up of characters it considers valid. >> >> 2) Any ByteString is valid UTF-8 in any of Squeak, Pharo, Cuis or any >> other Smalltalk with a single byte ByteString following ASCII or >> ISO-8859-1. >> >> 3) Any Smalltalk (or derivative language) using ByteString can >> immediately consider it's ByteString as valid UTF-8, as long as it >> also considers the ByteSring as valid ASCII and/or ISO-8859-1. >> >> 4) All of those can be successfully exported to any system using UTF-8 >> (e.g. HTML). >> >> 5) To successfully *accept* all UTF-8 we much be able to do either: >> a) accept UTF-8 strings with composed characters >> b) convert UTF-8 strings with composed characters into UTF-8 strings >> that use *only* compatibility codepoints. >> >> >> Class + protocol proposals >> >> >> >> a Utf8CompatibilityString class. >> >> asByteString - ensure only compatibility codepoints are used. >> Ensure it doews not encode characters above 00FF hex. >> >> asIso8859String - ensures only compatibility codepoints are used, >> and that the characters are each valid ISO 8859-1 >> >> asAsciiString - ensures only characters 00hex - 7F hex are used. >> >> asUtf8ComposedIso8859String - ensures all compatibility codepoints >> are expanded into small OrderedCollections of codepoints >> >> a Utf8ComposedIso8859String class - will provide sortable and >> comparable UTF8 strings of all ASCII and ISO 8859-1 strings. >> >> Then a Utf8SortableCollection class - a collection of >> Utf8ComposedIso8859Strings words and phrases. >> >> Custom sortBlocks will define the applicable sort order. >> >> We can create a collection... a Dictionary, thinking about it, of >> named, prefabricated sortBlocks. >> >> This will work for all UTF8 strings of ISO-8859-1 and ASCII strings. >> >> If anyone has better names for the classes, please let me know. >> >> If anyone else wants to help >> - build these, >> - create SUnit tests for these >> - write documentation for these >> Please let me know. >> >> n.b. I have had absolutely no experience of Ropes. >> >> My own background with this stuff: In the early 90's as a Project >> Manager implementing office automation systems across a global >> company, with offices in the Americas, Western, Eastern and Central >> Europe, (including Slavic and Cyrillic users) nations, Japan and >> China. The mission-critical application was word-processing. >> >> Our offices were spread around the globe, and we needed those offices >> to successfully exchange documents with their sister offices, and with >> the customers in each region the offices were in. >> >> Unicode was then new, and our platform supplier was the NeXT >> Corporation, who had been founder members in of the Unicode Consortium >> in 1990. >> >> So far: I've read the latest version of the Unicode Standard (v8.0). >> This is freely downloadable. >> I've purchased a paper copy of an earlier release. New releases >> typically consist additional codespaces (i.e. alphabets). So old >> copies are useful, as well as cheap. (Paper copies of version 4.0 >> are available second-hand for < $10 / €10). >> >> The typical change with each release is the addition of further >> codespaces (i.e alphabets (more or less) ), so you don't lose a lot. >> (I'll be going through my V4.0 just to make sure) >> >> Cheers, >> Euan >> >> >> >> >> On 5 December 2015 at 13:08, stepharo <[hidden email]> wrote: >>> Hi EuanM >>> >>> Le 4/12/15 12:42, EuanM a écrit : >>>> >>>> I'm currently groping my way to seeing how feature-complete our >>>> Unicode support is. I am doing this to establish what still needs to >>>> be done to provide full Unicode support. >>> >>> >>> this is great. Thanks for pushing this. I wrote and collected some roadmap >>> (analyses on different topics) >>> on the pharo github project feel free to add this one there. >>>> >>>> >>>> This seems to me to be an area where it would be best to write it >>>> once, and then have the same codebase incorporated into the Smalltalks >>>> that most share a common ancestry. >>>> >>>> I am keen to get: equality-testing for strings; sortability for >>>> strings which have ligatures and diacritic characters; and correct >>>> round-tripping of data. >>> >>> Go! >>> My suggestion is >>> start small >>> make steady progress >>> write tests >>> commit often :) >>> >>> Stef >>> >>> What is the french phoneBook ordering because this is the first time I hear >>> about it. >>> >>>> >>>> Call to action: >>>> ========== >>>> >>>> If you have comments on these proposals - such as "but we already have >>>> that facility" or "the reason we do not have these facilities is >>>> because they are dog-slow" - please let me know them. >>>> >>>> If you would like to help out, please let me know. >>>> >>>> If you have Unicode experience and expertise, and would like to be, or >>>> would be willing to be, in the 'council of experts' for this project, >>>> please let me know. >>>> >>>> If you have comments or ideas on anything mentioned in this email >>>> >>>> In the first instance, the initiative's website will be: >>>> http://smalltalk.uk.to/unicode.html >>>> >>>> I have created a SqueakSource.com project called UnicodeSupport >>>> >>>> I want to avoid re-inventing any facilities which already exist. >>>> Except where they prevent us reaching the goals of: >>>> - sortable UTF8 strings >>>> - sortable UTF16 strings >>>> - equivalence testing of 2 UTF8 strings >>>> - equivalence testing of 2 UTF16 strings >>>> - round-tripping UTF8 strings through Smalltalk >>>> - roundtripping UTF16 strings through Smalltalk. >>>> As I understand it, we have limited Unicode support atm. >>>> >>>> Current state of play >>>> =============== >>>> ByteString gets converted to WideString when need is automagically >>>> detected. >>>> >>>> Is there anything else that currently exists? >>>> >>>> Definition of Terms >>>> ============== >>>> A quick definition of terms before I go any further: >>>> >>>> Standard terms from the Unicode standard >>>> =============================== >>>> a compatibility character : an additional encoding of a *normal* >>>> character, for compatibility and round-trip conversion purposes. For >>>> instance, a 1-byte encoding of a Latin character with a diacritic. >>>> >>>> Made-up terms >>>> ============ >>>> a convenience codepoint : a single codepoint which represents an item >>>> that is also encoded as a string of codepoints. >>>> >>>> (I tend to use the terms compatibility character and compatibility >>>> codepoint interchangably. The standard only refers to them as >>>> compatibility characters. However, the standard is determined to >>>> emphasise that characters are abstract and that codepoints are >>>> concrete. So I think it is often more useful and productive to think >>>> of compatibility or convenience codepoints). >>>> >>>> a composed character : a character made up of several codepoints >>>> >>>> Unicode encoding explained >>>> ===================== >>>> A convenience codepoint can therefore be thought of as a code point >>>> used for a character which also has a composed form. >>>> >>>> The way Unicode works is that sometimes you can encode a character in >>>> one byte, sometimes not. Sometimes you can encode it in two bytes, >>>> sometimes not. >>>> >>>> You can therefore have a long stream of ASCII which is single-byte >>>> Unicode. If there is an occasional Cyrillic or Greek character in the >>>> stream, it would be represented either by a compatibility character or >>>> by a multi-byte combination. >>>> >>>> Using compatibility characters can prevent proper sorting and >>>> equivalence testing. >>>> >>>> Using "pure" Unicode, ie. "normal encodings", can cause compatibility >>>> and round-tripping probelms. Although avoiding them can *also* cause >>>> compatibility issues and round-tripping problems. >>>> >>>> Currently my thinking is: >>>> >>>> a Utf8String class >>>> an Ordered collection, with 1 byte characters as the modal element, >>>> but short arrays of wider strings where necessary >>>> a Utf16String class >>>> an Ordered collection, with 2 byte characters as the modal element, >>>> but short arrays of wider strings >>>> beginning with a 2-byte endianness indicator. >>>> >>>> Utf8Strings sometimes need to be sortable, and sometimes need to be >>>> compatible. >>>> >>>> So my thinking is that Utf8String will contain convenience codepoints, >>>> for round-tripping. And where there are multiple convenience >>>> codepoints for a character, that it standardises on one. >>>> >>>> And that there is a Utf8SortableString which uses *only* normal >>>> characters. >>>> >>>> We then need methods to convert between the two. >>>> >>>> aUtf8String asUtf8SortableString >>>> >>>> and >>>> >>>> aUtf8SortableString asUtf8String >>>> >>>> >>>> Sort orders are culture and context dependent - Sweden and Germany >>>> have different sort orders for the same diacritic-ed characters. Some >>>> countries have one order in general usage, and another for specific >>>> usages, such as phone directories (e.g. UK and France) >>>> >>>> Similarly for Utf16 : Utf16String and Utf16SortableString and >>>> conversion methods >>>> >>>> A list of sorted words would be a SortedCollection, and there could be >>>> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder, >>>> seOrder, ukOrder, etc >>>> >>>> along the lines of >>>> aListOfWords := SortedCollection sortBlock: deOrder >>>> >>>> If a word is either a Utf8SortableString, or a well-formed Utf8String, >>>> then we can perform equivalence testing on them trivially. >>>> >>>> To make sure a Utf8String is well formed, we would need to have a way >>>> of cleaning up any convenience codepoints which were valid, but which >>>> were for a character which has multiple equally-valid alternative >>>> convenience codepoints, and for which the string currently had the >>>> "wrong" convenience codepoint. (i.e for any character with valid >>>> alternative convenience codepoints, we would choose one to be in the >>>> well-formed Utf8String, and we would need a method for cleaning the >>>> alternative convenience codepoints out of the string, and replacing >>>> them with the chosen approved convenience codepoint. >>>> >>>> aUtf8String cleanUtf8String >>>> >>>> With WideString, a lot of the issues disappear - except >>>> round-tripping(although I'm sure I have seen something recently about >>>> 4-byte strings that also have an additional bit. Which would make >>>> some Unicode characters 5-bytes long.) >>>> >>>> >>>> (I'm starting to zone out now - if I've overlooked anything - obvious, >>>> subtle, or somewhere in between, please let me know) >>>> >>>> Cheers, >>>> Euan >>>> >>>> >>> >>> >> > > |
In reply to this post by EuanM
Hi Sven, okay I'm plodding my through https://tools.ietf.org/html/rfc3629 and
https://en.wikipedia.org/wiki/UTF-8#Examples to see what's what. On 7 December 2015 at 11:01, Sven Van Caekenberghe <[hidden email]> wrote: > >> On 07 Dec 2015, at 11:51, EuanM <[hidden email]> wrote: >> >> Verifying assumptions is the key reason why you should documents like >> this out for review. > > Fair enough, discussion can only help. > >> Sven - >> >> Cuis is encoded with ISO 8859-15 (aka ISO Latin 9) >> >> Sven, this is *NOT* as you state, ISO 99591, (and not as I stated, 8859-1). > > Ah, that was a typo, I meant, of course (and sorry for the confusion): > > 'Les élèves Français' encodeWith: #iso88591. > > "#[76 101 115 32 233 108 232 118 101 115 32 70 114 97 110 231 97 105 115]" > > 'Les élèves Français' utf8Encoded > > "#[76 101 115 32 195 169 108 195 168 118 101 115 32 70 114 97 110 195 167 97 105 115]" > > Or shorter, $é is encoded in ISO-88591-1 as #[233], but as #[195 169] in UTF-8. > > That Cuis chose ISO-8859-15 makes no real difference. > > The thing is: you started talking about UTF-8 encoded strings in the image, and then the difference between code point and encoding is really important. > > Only in ASCII is the encoding identical, not for anything else. > >> We caught the right specification bug for the wrong reason. >> >> Juan: "Cuis: Chose not to use Squeak approach. Chose to make the base >> image include and use only 1-byte strings. Chose to use ISO-8859-15" >> >> I have double-checked - each character encoded in ISO Latin 15 (ISO >> 8859-15) is exactly the character represented by the corresponding >> 1-byte codepoint in Unicode 0000 to 00FF, >> >> with the following exceptions: >> >> codepoint 20ac - Euro symbol >> character code a4 (replaces codepoint 00a4 generic currency symbol) >> >> codepoint 0160 Latin Upper Case S with Caron >> character code a6 (replaces codepoint 00A6 was | Unix pipe character) >> >> codepoint 0161 Latin Lower Case s with Caron >> character code a8 (replaces codepoint 00A8 was dierisis) >> >> codepoint 017d Latin Upper Case Z with Caron >> character code b4 (replaces codepoint 00b4 was Acute accent) >> >> codepoint 017e Latin Lower Case Z with Caron >> character code b8 (replaces codepoint 00b8 was cedilla) >> >> codepoint 0152 Upper Case OE ligature = Ethel >> character code bc (replaces codepoint 00bc was 1/4 symbol) >> >> codepoint 0153 Lower Case oe ligature = ethel >> character code bd (replaces codepoint 00bd was 1/2 symbol) >> >> codepoint 0178 Upper Case Y diaeresis >> character code be (replaces codepoint 00be was 3/4 symbol) >> >> Juan - I don't suppose we could persuade you to change to ISO Latin-1 >> from ISO Latin-9 ? >> >> It means we could run the same 1 byte string encoding across Cuis, >> Squeak, Pharo, and, as far as I can make out so far, Dolphin Smalltalk >> and Gnu Smalltalk. >> >> The downside would be that French Y diaeresis would lose the ability >> to use that character, along with users of oe, OE, and s, S, z, Z with >> caron. Along with the Euro. >> >> https://en.wikipedia.org/wiki/ISO/IEC_8859-15. >> >> I'm confident I understand the use of UTF-8 in principal. >> >> >> On 7 December 2015 at 08:27, Sven Van Caekenberghe <[hidden email]> wrote: >>> I am sorry but one of your basic assumptions is completely wrong: >>> >>> 'Les élèves Français' encodeWith: #iso99591. >>> >>> => #[76 101 115 32 233 108 232 118 101 115 32 70 114 97 110 231 97 105 115] >>> >>> 'Les élèves Français' utf8Encoded. >>> >>> => #[76 101 115 32 195 169 108 195 168 118 101 115 32 70 114 97 110 195 167 97 105 115] >>> >>> ISO-9959-1 (~Latin1) is NOT AT ALL identical to UTF-8 in its upper, non-ACII part !! >>> >>> Or shorter, $é is encoded in ISO-9959-1 as #[233], but as #[195 169] in UTF-8. >>> >>> So more than half the points you make, or the facts that you state, are thus plain wrong. >>> >>> The only thing that is correct is that the code points are equal, but that is not the same as the encoding ! >>> >>> From this I am inclined to conclude that you do not fundamentally understand how UTF-8 works, which does not strike me as good basis to design something called a UTF8String. >>> >>> Sorry. >>> >>> PS: Note also that Cuis' choice to use ISO-9959-1 only is pretty limiting in a Unicode world. >>> >>>> On 07 Dec 2015, at 04:21, EuanM <[hidden email]> wrote: >>>> >>>> This a long email. A *lot* of it is encapsulated in the Venn diagram both: >>>> http://smalltalk.uk.to/unicode-utf8.html >>>> and my Smalltalk in Small Steps blog at: >>>> http://smalltalkinsmallsteps.blogspot.co.uk/2015/12/utf-8-for-cuis-pharo-and-squeak.html >>>> >>>> My current thinking, and understanding. >>>> ============================== >>>> >>>> 0) a) ASCII and ISO-8859-1 consist of characters encoded in 1 byte. >>>> b) UTF-8 can encode all of those characters in 1 byte, but can >>>> prefer some of them to be encoded as sequences of multiple bytes. And >>>> can encode additional characters as sequences of multiple bytes. >>>> >>>> 1) Smalltalk has long had multiple String classes. >>>> >>>> 2) Any UTF16 Unicode codepoint which has a codepoint of 00nn hex >>>> is encoded as a UTF-8 codepoint of nn hex. >>>> >>>> 3) All valid ISO-8859-1 characters have a character code between 20 >>>> hex and 7E hex, or between A0 hex and FF hex. >>>> https://en.wikipedia.org/wiki/ISO/IEC_8859-1 >>>> >>>> 4) All valid ASCII characters have a character code between 00 hex and 7E hex. >>>> https://en.wikipedia.org/wiki/ASCII >>>> >>>> >>>> 5) a) All character codes which are defined within ISO-8859-1 and also >>>> defined within ASCII. (i.e. character codes 20 hex to 7E hex) are >>>> defined identically in both. >>>> >>>> b) All printable ASCII characters are defined identically in both >>>> ASCII and ISO-8859-1 >>>> >>>> 6) All character codes defined in ASCII (00 hex to 7E hex) are >>>> defined identically in Unicode UTF-8. >>>> >>>> 7) All character codes defined in ISO-8859-1 (20 hex - 7E hex ; A0 hex >>>> - FF hex ) are defined identically in UTF-8. >>>> >>>> 8) => some Unicode codepoints map to both ASCII and ISO-8859-1. >>>> all ASCII maps 1:1 to Unicode UTF-8 >>>> all ISO-8859-1 maps 1:1 to Unicode UTF-8 >>>> >>>> 9) All ByteStrings elements which are either a valid ISO-8859-1 >>>> character or a valid ASCII character are *also* a valid UTF-8 >>>> character. >>>> >>>> 10) ISO-8859-1 characters representing a character with a diacritic, >>>> or a two-character ligature, have no ASCII equivalent. In Unicode >>>> UTF-8, those character codes which are representing compound glyphs, >>>> are called "compatibility codepoints". >>>> >>>> 11) The preferred Unicode representation of the characters which have >>>> compatibility codepoints is as a a short set of codepoints >>>> representing the characters which are combined together to form the >>>> glyph of the convenience codepoint, as a sequence of bytes >>>> representing the component characters. >>>> >>>> >>>> 12) Some concrete examples: >>>> >>>> A - aka Upper Case A >>>> In ASCII, in ISO 8859-1 >>>> ASCII A - 41 hex >>>> ISO-8859-1 A - 41 hex >>>> UTF-8 A - 41 hex >>>> >>>> BEL (a bell sound, often invoked by a Ctrl-g keyboard chord) >>>> In ASCII, not in ISO 8859-1 >>>> ASCII : BEL - 07 hex >>>> ISO-8859-1 : 07 hex is not a valid character code >>>> UTF-8 : BEL - 07 hex >>>> >>>> £ (GBP currency symbol) >>>> In ISO-8859-1, not in ASCII >>>> ASCII : A3 hex is not a valid ASCII code >>>> UTF-8: £ - A3 hex >>>> ISO-8859-1: £ - A3 hex >>>> >>>> Upper Case C cedilla >>>> In ISO-8859-1, not in ASCII, in UTF-8 as a compatibility codepoint >>>> *and* a composed set of codepoints >>>> ASCII : C7 hex is not a valid ASCII character code >>>> ISO-8859-1 : Upper Case C cedilla - C7 hex >>>> UTF-8 : Upper Case C cedilla (compatibility codepoint) - C7 hex >>>> Unicode preferred Upper Case C cedilla (composed set of codepoints) >>>> Upper case C 0043 hex (Upper case C) >>>> followed by >>>> cedilla 00B8 hex (cedilla) >>>> >>>> 13) For any valid ASCII string *and* for any valid ISO-8859-1 string, >>>> aByteString is completely adequate for editing and display. >>>> >>>> 14) When sorting any valid ASCII string *or* any valid ISO-8859-1 >>>> string, upper and lower case versions of the same character will be >>>> treated differently. >>>> >>>> 15) When sorting any valid ISO-8859-1 string containing >>>> letter+diacritic combination glyphs or ligature combination glyphs, >>>> the glyphs in combination will treated differently to a "plain" glyph >>>> of the character >>>> i.e. "C" and "C cedilla" will be treated very differently. "ß" and >>>> "fs" will be treated very differently. >>>> >>>> 16) Different nations have different rules about where diacritic-ed >>>> characted and ligature pairs should be placed when in alphabetical >>>> order. >>>> >>>> 17) Some nations even have multiple standards - e.g. surnames >>>> beginning either "M superscript-c" or "M superscript-a superscript-c" >>>> are treated as beginning equivalently in UK phone directories, but not >>>> in other situations. >>>> >>>> >>>> Some practical upshots >>>> ================== >>>> >>>> 1) Cuis and its ISO-8859-1 encoding is *exactly* the same as UTF-8, >>>> for any single character it considers valid, or any ByteString it has >>>> made up of characters it considers valid. >>>> >>>> 2) Any ByteString is valid UTF-8 in any of Squeak, Pharo, Cuis or any >>>> other Smalltalk with a single byte ByteString following ASCII or >>>> ISO-8859-1. >>>> >>>> 3) Any Smalltalk (or derivative language) using ByteString can >>>> immediately consider it's ByteString as valid UTF-8, as long as it >>>> also considers the ByteSring as valid ASCII and/or ISO-8859-1. >>>> >>>> 4) All of those can be successfully exported to any system using UTF-8 >>>> (e.g. HTML). >>>> >>>> 5) To successfully *accept* all UTF-8 we much be able to do either: >>>> a) accept UTF-8 strings with composed characters >>>> b) convert UTF-8 strings with composed characters into UTF-8 strings >>>> that use *only* compatibility codepoints. >>>> >>>> >>>> Class + protocol proposals >>>> >>>> >>>> >>>> a Utf8CompatibilityString class. >>>> >>>> asByteString - ensure only compatibility codepoints are used. >>>> Ensure it doews not encode characters above 00FF hex. >>>> >>>> asIso8859String - ensures only compatibility codepoints are used, >>>> and that the characters are each valid ISO 8859-1 >>>> >>>> asAsciiString - ensures only characters 00hex - 7F hex are used. >>>> >>>> asUtf8ComposedIso8859String - ensures all compatibility codepoints >>>> are expanded into small OrderedCollections of codepoints >>>> >>>> a Utf8ComposedIso8859String class - will provide sortable and >>>> comparable UTF8 strings of all ASCII and ISO 8859-1 strings. >>>> >>>> Then a Utf8SortableCollection class - a collection of >>>> Utf8ComposedIso8859Strings words and phrases. >>>> >>>> Custom sortBlocks will define the applicable sort order. >>>> >>>> We can create a collection... a Dictionary, thinking about it, of >>>> named, prefabricated sortBlocks. >>>> >>>> This will work for all UTF8 strings of ISO-8859-1 and ASCII strings. >>>> >>>> If anyone has better names for the classes, please let me know. >>>> >>>> If anyone else wants to help >>>> - build these, >>>> - create SUnit tests for these >>>> - write documentation for these >>>> Please let me know. >>>> >>>> n.b. I have had absolutely no experience of Ropes. >>>> >>>> My own background with this stuff: In the early 90's as a Project >>>> Manager implementing office automation systems across a global >>>> company, with offices in the Americas, Western, Eastern and Central >>>> Europe, (including Slavic and Cyrillic users) nations, Japan and >>>> China. The mission-critical application was word-processing. >>>> >>>> Our offices were spread around the globe, and we needed those offices >>>> to successfully exchange documents with their sister offices, and with >>>> the customers in each region the offices were in. >>>> >>>> Unicode was then new, and our platform supplier was the NeXT >>>> Corporation, who had been founder members in of the Unicode Consortium >>>> in 1990. >>>> >>>> So far: I've read the latest version of the Unicode Standard (v8.0). >>>> This is freely downloadable. >>>> I've purchased a paper copy of an earlier release. New releases >>>> typically consist additional codespaces (i.e. alphabets). So old >>>> copies are useful, as well as cheap. (Paper copies of version 4.0 >>>> are available second-hand for < $10 / €10). >>>> >>>> The typical change with each release is the addition of further >>>> codespaces (i.e alphabets (more or less) ), so you don't lose a lot. >>>> (I'll be going through my V4.0 just to make sure) >>>> >>>> Cheers, >>>> Euan >>>> >>>> >>>> >>>> >>>> On 5 December 2015 at 13:08, stepharo <[hidden email]> wrote: >>>>> Hi EuanM >>>>> >>>>> Le 4/12/15 12:42, EuanM a écrit : >>>>>> >>>>>> I'm currently groping my way to seeing how feature-complete our >>>>>> Unicode support is. I am doing this to establish what still needs to >>>>>> be done to provide full Unicode support. >>>>> >>>>> >>>>> this is great. Thanks for pushing this. I wrote and collected some roadmap >>>>> (analyses on different topics) >>>>> on the pharo github project feel free to add this one there. >>>>>> >>>>>> >>>>>> This seems to me to be an area where it would be best to write it >>>>>> once, and then have the same codebase incorporated into the Smalltalks >>>>>> that most share a common ancestry. >>>>>> >>>>>> I am keen to get: equality-testing for strings; sortability for >>>>>> strings which have ligatures and diacritic characters; and correct >>>>>> round-tripping of data. >>>>> >>>>> Go! >>>>> My suggestion is >>>>> start small >>>>> make steady progress >>>>> write tests >>>>> commit often :) >>>>> >>>>> Stef >>>>> >>>>> What is the french phoneBook ordering because this is the first time I hear >>>>> about it. >>>>> >>>>>> >>>>>> Call to action: >>>>>> ========== >>>>>> >>>>>> If you have comments on these proposals - such as "but we already have >>>>>> that facility" or "the reason we do not have these facilities is >>>>>> because they are dog-slow" - please let me know them. >>>>>> >>>>>> If you would like to help out, please let me know. >>>>>> >>>>>> If you have Unicode experience and expertise, and would like to be, or >>>>>> would be willing to be, in the 'council of experts' for this project, >>>>>> please let me know. >>>>>> >>>>>> If you have comments or ideas on anything mentioned in this email >>>>>> >>>>>> In the first instance, the initiative's website will be: >>>>>> http://smalltalk.uk.to/unicode.html >>>>>> >>>>>> I have created a SqueakSource.com project called UnicodeSupport >>>>>> >>>>>> I want to avoid re-inventing any facilities which already exist. >>>>>> Except where they prevent us reaching the goals of: >>>>>> - sortable UTF8 strings >>>>>> - sortable UTF16 strings >>>>>> - equivalence testing of 2 UTF8 strings >>>>>> - equivalence testing of 2 UTF16 strings >>>>>> - round-tripping UTF8 strings through Smalltalk >>>>>> - roundtripping UTF16 strings through Smalltalk. >>>>>> As I understand it, we have limited Unicode support atm. >>>>>> >>>>>> Current state of play >>>>>> =============== >>>>>> ByteString gets converted to WideString when need is automagically >>>>>> detected. >>>>>> >>>>>> Is there anything else that currently exists? >>>>>> >>>>>> Definition of Terms >>>>>> ============== >>>>>> A quick definition of terms before I go any further: >>>>>> >>>>>> Standard terms from the Unicode standard >>>>>> =============================== >>>>>> a compatibility character : an additional encoding of a *normal* >>>>>> character, for compatibility and round-trip conversion purposes. For >>>>>> instance, a 1-byte encoding of a Latin character with a diacritic. >>>>>> >>>>>> Made-up terms >>>>>> ============ >>>>>> a convenience codepoint : a single codepoint which represents an item >>>>>> that is also encoded as a string of codepoints. >>>>>> >>>>>> (I tend to use the terms compatibility character and compatibility >>>>>> codepoint interchangably. The standard only refers to them as >>>>>> compatibility characters. However, the standard is determined to >>>>>> emphasise that characters are abstract and that codepoints are >>>>>> concrete. So I think it is often more useful and productive to think >>>>>> of compatibility or convenience codepoints). >>>>>> >>>>>> a composed character : a character made up of several codepoints >>>>>> >>>>>> Unicode encoding explained >>>>>> ===================== >>>>>> A convenience codepoint can therefore be thought of as a code point >>>>>> used for a character which also has a composed form. >>>>>> >>>>>> The way Unicode works is that sometimes you can encode a character in >>>>>> one byte, sometimes not. Sometimes you can encode it in two bytes, >>>>>> sometimes not. >>>>>> >>>>>> You can therefore have a long stream of ASCII which is single-byte >>>>>> Unicode. If there is an occasional Cyrillic or Greek character in the >>>>>> stream, it would be represented either by a compatibility character or >>>>>> by a multi-byte combination. >>>>>> >>>>>> Using compatibility characters can prevent proper sorting and >>>>>> equivalence testing. >>>>>> >>>>>> Using "pure" Unicode, ie. "normal encodings", can cause compatibility >>>>>> and round-tripping probelms. Although avoiding them can *also* cause >>>>>> compatibility issues and round-tripping problems. >>>>>> >>>>>> Currently my thinking is: >>>>>> >>>>>> a Utf8String class >>>>>> an Ordered collection, with 1 byte characters as the modal element, >>>>>> but short arrays of wider strings where necessary >>>>>> a Utf16String class >>>>>> an Ordered collection, with 2 byte characters as the modal element, >>>>>> but short arrays of wider strings >>>>>> beginning with a 2-byte endianness indicator. >>>>>> >>>>>> Utf8Strings sometimes need to be sortable, and sometimes need to be >>>>>> compatible. >>>>>> >>>>>> So my thinking is that Utf8String will contain convenience codepoints, >>>>>> for round-tripping. And where there are multiple convenience >>>>>> codepoints for a character, that it standardises on one. >>>>>> >>>>>> And that there is a Utf8SortableString which uses *only* normal >>>>>> characters. >>>>>> >>>>>> We then need methods to convert between the two. >>>>>> >>>>>> aUtf8String asUtf8SortableString >>>>>> >>>>>> and >>>>>> >>>>>> aUtf8SortableString asUtf8String >>>>>> >>>>>> >>>>>> Sort orders are culture and context dependent - Sweden and Germany >>>>>> have different sort orders for the same diacritic-ed characters. Some >>>>>> countries have one order in general usage, and another for specific >>>>>> usages, such as phone directories (e.g. UK and France) >>>>>> >>>>>> Similarly for Utf16 : Utf16String and Utf16SortableString and >>>>>> conversion methods >>>>>> >>>>>> A list of sorted words would be a SortedCollection, and there could be >>>>>> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder, >>>>>> seOrder, ukOrder, etc >>>>>> >>>>>> along the lines of >>>>>> aListOfWords := SortedCollection sortBlock: deOrder >>>>>> >>>>>> If a word is either a Utf8SortableString, or a well-formed Utf8String, >>>>>> then we can perform equivalence testing on them trivially. >>>>>> >>>>>> To make sure a Utf8String is well formed, we would need to have a way >>>>>> of cleaning up any convenience codepoints which were valid, but which >>>>>> were for a character which has multiple equally-valid alternative >>>>>> convenience codepoints, and for which the string currently had the >>>>>> "wrong" convenience codepoint. (i.e for any character with valid >>>>>> alternative convenience codepoints, we would choose one to be in the >>>>>> well-formed Utf8String, and we would need a method for cleaning the >>>>>> alternative convenience codepoints out of the string, and replacing >>>>>> them with the chosen approved convenience codepoint. >>>>>> >>>>>> aUtf8String cleanUtf8String >>>>>> >>>>>> With WideString, a lot of the issues disappear - except >>>>>> round-tripping(although I'm sure I have seen something recently about >>>>>> 4-byte strings that also have an additional bit. Which would make >>>>>> some Unicode characters 5-bytes long.) >>>>>> >>>>>> >>>>>> (I'm starting to zone out now - if I've overlooked anything - obvious, >>>>>> subtle, or somewhere in between, please let me know) >>>>>> >>>>>> Cheers, >>>>>> Euan >>>>>> >>>>>> >>>>> >>>>> >>> >>>> On 07 Dec 2015, at 04:21, EuanM <[hidden email]> wrote: >>>> >>>> This a long email. A *lot* of it is encapsulated in the Venn diagram both: >>>> http://smalltalk.uk.to/unicode-utf8.html >>>> and my Smalltalk in Small Steps blog at: >>>> http://smalltalkinsmallsteps.blogspot.co.uk/2015/12/utf-8-for-cuis-pharo-and-squeak.html >>>> >>>> My current thinking, and understanding. >>>> ============================== >>>> >>>> 0) a) ASCII and ISO-8859-1 consist of characters encoded in 1 byte. >>>> b) UTF-8 can encode all of those characters in 1 byte, but can >>>> prefer some of them to be encoded as sequences of multiple bytes. And >>>> can encode additional characters as sequences of multiple bytes. >>>> >>>> 1) Smalltalk has long had multiple String classes. >>>> >>>> 2) Any UTF16 Unicode codepoint which has a codepoint of 00nn hex >>>> is encoded as a UTF-8 codepoint of nn hex. >>>> >>>> 3) All valid ISO-8859-1 characters have a character code between 20 >>>> hex and 7E hex, or between A0 hex and FF hex. >>>> https://en.wikipedia.org/wiki/ISO/IEC_8859-1 >>>> >>>> 4) All valid ASCII characters have a character code between 00 hex and 7E hex. >>>> https://en.wikipedia.org/wiki/ASCII >>>> >>>> >>>> 5) a) All character codes which are defined within ISO-8859-1 and also >>>> defined within ASCII. (i.e. character codes 20 hex to 7E hex) are >>>> defined identically in both. >>>> >>>> b) All printable ASCII characters are defined identically in both >>>> ASCII and ISO-8859-1 >>>> >>>> 6) All character codes defined in ASCII (00 hex to 7E hex) are >>>> defined identically in Unicode UTF-8. >>>> >>>> 7) All character codes defined in ISO-8859-1 (20 hex - 7E hex ; A0 hex >>>> - FF hex ) are defined identically in UTF-8. >>>> >>>> 8) => some Unicode codepoints map to both ASCII and ISO-8859-1. >>>> all ASCII maps 1:1 to Unicode UTF-8 >>>> all ISO-8859-1 maps 1:1 to Unicode UTF-8 >>>> >>>> 9) All ByteStrings elements which are either a valid ISO-8859-1 >>>> character or a valid ASCII character are *also* a valid UTF-8 >>>> character. >>>> >>>> 10) ISO-8859-1 characters representing a character with a diacritic, >>>> or a two-character ligature, have no ASCII equivalent. In Unicode >>>> UTF-8, those character codes which are representing compound glyphs, >>>> are called "compatibility codepoints". >>>> >>>> 11) The preferred Unicode representation of the characters which have >>>> compatibility codepoints is as a a short set of codepoints >>>> representing the characters which are combined together to form the >>>> glyph of the convenience codepoint, as a sequence of bytes >>>> representing the component characters. >>>> >>>> >>>> 12) Some concrete examples: >>>> >>>> A - aka Upper Case A >>>> In ASCII, in ISO 8859-1 >>>> ASCII A - 41 hex >>>> ISO-8859-1 A - 41 hex >>>> UTF-8 A - 41 hex >>>> >>>> BEL (a bell sound, often invoked by a Ctrl-g keyboard chord) >>>> In ASCII, not in ISO 8859-1 >>>> ASCII : BEL - 07 hex >>>> ISO-8859-1 : 07 hex is not a valid character code >>>> UTF-8 : BEL - 07 hex >>>> >>>> £ (GBP currency symbol) >>>> In ISO-8859-1, not in ASCII >>>> ASCII : A3 hex is not a valid ASCII code >>>> UTF-8: £ - A3 hex >>>> ISO-8859-1: £ - A3 hex >>>> >>>> Upper Case C cedilla >>>> In ISO-8859-1, not in ASCII, in UTF-8 as a compatibility codepoint >>>> *and* a composed set of codepoints >>>> ASCII : C7 hex is not a valid ASCII character code >>>> ISO-8859-1 : Upper Case C cedilla - C7 hex >>>> UTF-8 : Upper Case C cedilla (compatibility codepoint) - C7 hex >>>> Unicode preferred Upper Case C cedilla (composed set of codepoints) >>>> Upper case C 0043 hex (Upper case C) >>>> followed by >>>> cedilla 00B8 hex (cedilla) >>>> >>>> 13) For any valid ASCII string *and* for any valid ISO-8859-1 string, >>>> aByteString is completely adequate for editing and display. >>>> >>>> 14) When sorting any valid ASCII string *or* any valid ISO-8859-1 >>>> string, upper and lower case versions of the same character will be >>>> treated differently. >>>> >>>> 15) When sorting any valid ISO-8859-1 string containing >>>> letter+diacritic combination glyphs or ligature combination glyphs, >>>> the glyphs in combination will treated differently to a "plain" glyph >>>> of the character >>>> i.e. "C" and "C cedilla" will be treated very differently. "ß" and >>>> "fs" will be treated very differently. >>>> >>>> 16) Different nations have different rules about where diacritic-ed >>>> characted and ligature pairs should be placed when in alphabetical >>>> order. >>>> >>>> 17) Some nations even have multiple standards - e.g. surnames >>>> beginning either "M superscript-c" or "M superscript-a superscript-c" >>>> are treated as beginning equivalently in UK phone directories, but not >>>> in other situations. >>>> >>>> >>>> Some practical upshots >>>> ================== >>>> >>>> 1) Cuis and its ISO-8859-1 encoding is *exactly* the same as UTF-8, >>>> for any single character it considers valid, or any ByteString it has >>>> made up of characters it considers valid. >>>> >>>> 2) Any ByteString is valid UTF-8 in any of Squeak, Pharo, Cuis or any >>>> other Smalltalk with a single byte ByteString following ASCII or >>>> ISO-8859-1. >>>> >>>> 3) Any Smalltalk (or derivative language) using ByteString can >>>> immediately consider it's ByteString as valid UTF-8, as long as it >>>> also considers the ByteSring as valid ASCII and/or ISO-8859-1. >>>> >>>> 4) All of those can be successfully exported to any system using UTF-8 >>>> (e.g. HTML). >>>> >>>> 5) To successfully *accept* all UTF-8 we much be able to do either: >>>> a) accept UTF-8 strings with composed characters >>>> b) convert UTF-8 strings with composed characters into UTF-8 strings >>>> that use *only* compatibility codepoints. >>>> >>>> >>>> Class + protocol proposals >>>> >>>> >>>> >>>> a Utf8CompatibilityString class. >>>> >>>> asByteString - ensure only compatibility codepoints are used. >>>> Ensure it doews not encode characters above 00FF hex. >>>> >>>> asIso8859String - ensures only compatibility codepoints are used, >>>> and that the characters are each valid ISO 8859-1 >>>> >>>> asAsciiString - ensures only characters 00hex - 7F hex are used. >>>> >>>> asUtf8ComposedIso8859String - ensures all compatibility codepoints >>>> are expanded into small OrderedCollections of codepoints >>>> >>>> a Utf8ComposedIso8859String class - will provide sortable and >>>> comparable UTF8 strings of all ASCII and ISO 8859-1 strings. >>>> >>>> Then a Utf8SortableCollection class - a collection of >>>> Utf8ComposedIso8859Strings words and phrases. >>>> >>>> Custom sortBlocks will define the applicable sort order. >>>> >>>> We can create a collection... a Dictionary, thinking about it, of >>>> named, prefabricated sortBlocks. >>>> >>>> This will work for all UTF8 strings of ISO-8859-1 and ASCII strings. >>>> >>>> If anyone has better names for the classes, please let me know. >>>> >>>> If anyone else wants to help >>>> - build these, >>>> - create SUnit tests for these >>>> - write documentation for these >>>> Please let me know. >>>> >>>> n.b. I have had absolutely no experience of Ropes. >>>> >>>> My own background with this stuff: In the early 90's as a Project >>>> Manager implementing office automation systems across a global >>>> company, with offices in the Americas, Western, Eastern and Central >>>> Europe, (including Slavic and Cyrillic users) nations, Japan and >>>> China. The mission-critical application was word-processing. >>>> >>>> Our offices were spread around the globe, and we needed those offices >>>> to successfully exchange documents with their sister offices, and with >>>> the customers in each region the offices were in. >>>> >>>> Unicode was then new, and our platform supplier was the NeXT >>>> Corporation, who had been founder members in of the Unicode Consortium >>>> in 1990. >>>> >>>> So far: I've read the latest version of the Unicode Standard (v8.0). >>>> This is freely downloadable. >>>> I've purchased a paper copy of an earlier release. New releases >>>> typically consist additional codespaces (i.e. alphabets). So old >>>> copies are useful, as well as cheap. (Paper copies of version 4.0 >>>> are available second-hand for < $10 / €10). >>>> >>>> The typical change with each release is the addition of further >>>> codespaces (i.e alphabets (more or less) ), so you don't lose a lot. >>>> (I'll be going through my V4.0 just to make sure) >>>> >>>> Cheers, >>>> Euan >>>> >>>> >>>> >>>> >>>> On 5 December 2015 at 13:08, stepharo <[hidden email]> wrote: >>>>> Hi EuanM >>>>> >>>>> Le 4/12/15 12:42, EuanM a écrit : >>>>>> >>>>>> I'm currently groping my way to seeing how feature-complete our >>>>>> Unicode support is. I am doing this to establish what still needs to >>>>>> be done to provide full Unicode support. >>>>> >>>>> >>>>> this is great. Thanks for pushing this. I wrote and collected some roadmap >>>>> (analyses on different topics) >>>>> on the pharo github project feel free to add this one there. >>>>>> >>>>>> >>>>>> This seems to me to be an area where it would be best to write it >>>>>> once, and then have the same codebase incorporated into the Smalltalks >>>>>> that most share a common ancestry. >>>>>> >>>>>> I am keen to get: equality-testing for strings; sortability for >>>>>> strings which have ligatures and diacritic characters; and correct >>>>>> round-tripping of data. >>>>> >>>>> Go! >>>>> My suggestion is >>>>> start small >>>>> make steady progress >>>>> write tests >>>>> commit often :) >>>>> >>>>> Stef >>>>> >>>>> What is the french phoneBook ordering because this is the first time I hear >>>>> about it. >>>>> >>>>>> >>>>>> Call to action: >>>>>> ========== >>>>>> >>>>>> If you have comments on these proposals - such as "but we already have >>>>>> that facility" or "the reason we do not have these facilities is >>>>>> because they are dog-slow" - please let me know them. >>>>>> >>>>>> If you would like to help out, please let me know. >>>>>> >>>>>> If you have Unicode experience and expertise, and would like to be, or >>>>>> would be willing to be, in the 'council of experts' for this project, >>>>>> please let me know. >>>>>> >>>>>> If you have comments or ideas on anything mentioned in this email >>>>>> >>>>>> In the first instance, the initiative's website will be: >>>>>> http://smalltalk.uk.to/unicode.html >>>>>> >>>>>> I have created a SqueakSource.com project called UnicodeSupport >>>>>> >>>>>> I want to avoid re-inventing any facilities which already exist. >>>>>> Except where they prevent us reaching the goals of: >>>>>> - sortable UTF8 strings >>>>>> - sortable UTF16 strings >>>>>> - equivalence testing of 2 UTF8 strings >>>>>> - equivalence testing of 2 UTF16 strings >>>>>> - round-tripping UTF8 strings through Smalltalk >>>>>> - roundtripping UTF16 strings through Smalltalk. >>>>>> As I understand it, we have limited Unicode support atm. >>>>>> >>>>>> Current state of play >>>>>> =============== >>>>>> ByteString gets converted to WideString when need is automagically >>>>>> detected. >>>>>> >>>>>> Is there anything else that currently exists? >>>>>> >>>>>> Definition of Terms >>>>>> ============== >>>>>> A quick definition of terms before I go any further: >>>>>> >>>>>> Standard terms from the Unicode standard >>>>>> =============================== >>>>>> a compatibility character : an additional encoding of a *normal* >>>>>> character, for compatibility and round-trip conversion purposes. For >>>>>> instance, a 1-byte encoding of a Latin character with a diacritic. >>>>>> >>>>>> Made-up terms >>>>>> ============ >>>>>> a convenience codepoint : a single codepoint which represents an item >>>>>> that is also encoded as a string of codepoints. >>>>>> >>>>>> (I tend to use the terms compatibility character and compatibility >>>>>> codepoint interchangably. The standard only refers to them as >>>>>> compatibility characters. However, the standard is determined to >>>>>> emphasise that characters are abstract and that codepoints are >>>>>> concrete. So I think it is often more useful and productive to think >>>>>> of compatibility or convenience codepoints). >>>>>> >>>>>> a composed character : a character made up of several codepoints >>>>>> >>>>>> Unicode encoding explained >>>>>> ===================== >>>>>> A convenience codepoint can therefore be thought of as a code point >>>>>> used for a character which also has a composed form. >>>>>> >>>>>> The way Unicode works is that sometimes you can encode a character in >>>>>> one byte, sometimes not. Sometimes you can encode it in two bytes, >>>>>> sometimes not. >>>>>> >>>>>> You can therefore have a long stream of ASCII which is single-byte >>>>>> Unicode. If there is an occasional Cyrillic or Greek character in the >>>>>> stream, it would be represented either by a compatibility character or >>>>>> by a multi-byte combination. >>>>>> >>>>>> Using compatibility characters can prevent proper sorting and >>>>>> equivalence testing. >>>>>> >>>>>> Using "pure" Unicode, ie. "normal encodings", can cause compatibility >>>>>> and round-tripping probelms. Although avoiding them can *also* cause >>>>>> compatibility issues and round-tripping problems. >>>>>> >>>>>> Currently my thinking is: >>>>>> >>>>>> a Utf8String class >>>>>> an Ordered collection, with 1 byte characters as the modal element, >>>>>> but short arrays of wider strings where necessary >>>>>> a Utf16String class >>>>>> an Ordered collection, with 2 byte characters as the modal element, >>>>>> but short arrays of wider strings >>>>>> beginning with a 2-byte endianness indicator. >>>>>> >>>>>> Utf8Strings sometimes need to be sortable, and sometimes need to be >>>>>> compatible. >>>>>> >>>>>> So my thinking is that Utf8String will contain convenience codepoints, >>>>>> for round-tripping. And where there are multiple convenience >>>>>> codepoints for a character, that it standardises on one. >>>>>> >>>>>> And that there is a Utf8SortableString which uses *only* normal >>>>>> characters. >>>>>> >>>>>> We then need methods to convert between the two. >>>>>> >>>>>> aUtf8String asUtf8SortableString >>>>>> >>>>>> and >>>>>> >>>>>> aUtf8SortableString asUtf8String >>>>>> >>>>>> >>>>>> Sort orders are culture and context dependent - Sweden and Germany >>>>>> have different sort orders for the same diacritic-ed characters. Some >>>>>> countries have one order in general usage, and another for specific >>>>>> usages, such as phone directories (e.g. UK and France) >>>>>> >>>>>> Similarly for Utf16 : Utf16String and Utf16SortableString and >>>>>> conversion methods >>>>>> >>>>>> A list of sorted words would be a SortedCollection, and there could be >>>>>> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder, >>>>>> seOrder, ukOrder, etc >>>>>> >>>>>> along the lines of >>>>>> aListOfWords := SortedCollection sortBlock: deOrder >>>>>> >>>>>> If a word is either a Utf8SortableString, or a well-formed Utf8String, >>>>>> then we can perform equivalence testing on them trivially. >>>>>> >>>>>> To make sure a Utf8String is well formed, we would need to have a way >>>>>> of cleaning up any convenience codepoints which were valid, but which >>>>>> were for a character which has multiple equally-valid alternative >>>>>> convenience codepoints, and for which the string currently had the >>>>>> "wrong" convenience codepoint. (i.e for any character with valid >>>>>> alternative convenience codepoints, we would choose one to be in the >>>>>> well-formed Utf8String, and we would need a method for cleaning the >>>>>> alternative convenience codepoints out of the string, and replacing >>>>>> them with the chosen approved convenience codepoint. >>>>>> >>>>>> aUtf8String cleanUtf8String >>>>>> >>>>>> With WideString, a lot of the issues disappear - except >>>>>> round-tripping(although I'm sure I have seen something recently about >>>>>> 4-byte strings that also have an additional bit. Which would make >>>>>> some Unicode characters 5-bytes long.) >>>>>> >>>>>> >>>>>> (I'm starting to zone out now - if I've overlooked anything - obvious, >>>>>> subtle, or somewhere in between, please let me know) >>>>>> >>>>>> Cheers, >>>>>> Euan >>>>>> >>>>>> >>>>> >>>>> >>>> >>> >>> >> > > |
In reply to this post by EuanM
Hi all,
First of all, I'm sorry for leaving Squeak m17n work incomplete. Things are degrading a bit by bit and many things are not working as good as before, unfortunately. That said, there are a few things I'd like to mention: On Sun, Dec 6, 2015 at 7:21 PM, EuanM <[hidden email]> wrote: > This a long email. A *lot* of it is encapsulated in the Venn diagram both: > http://smalltalk.uk.to/unicode-utf8.html > and my Smalltalk in Small Steps blog at: > http://smalltalkinsmallsteps.blogspot.co.uk/2015/12/utf-8-for-cuis-pharo-and-squeak.html > > My current thinking, and understanding. > ============================== > > 0) a) ASCII and ISO-8859-1 consist of characters encoded in 1 byte. > b) UTF-8 can encode all of those characters in 1 byte, but can > prefer some of them to be encoded as sequences of multiple bytes. And > can encode additional characters as sequences of multiple bytes. > > 1) Smalltalk has long had multiple String classes. Yes, but never meant to make it user visible, in the same sense that a typical user does not (always) have to worry about the difference between SmallInteger and LargeInteger. > 2) Any UTF16 Unicode codepoint which has a codepoint of 00nn hex > is encoded as a UTF-8 codepoint of nn hex. module endianness, but yes. > 7) All character codes defined in ISO-8859-1 (20 hex - 7E hex ; A0 hex > - FF hex ) are defined identically in UTF-8. 3) to 6) are more or less correct but this 7) is not right, if you mean what I think you mean. > 8) => some Unicode codepoints map to both ASCII and ISO-8859-1. > all ASCII maps 1:1 to Unicode UTF-8 > all ISO-8859-1 maps 1:1 to Unicode UTF-8 so this is not correct in the same reason. > 9) All ByteStrings elements which are either a valid ISO-8859-1 > character or a valid ASCII character are *also* a valid UTF-8 > character. No. ByteStrings are meant to be ISO-8859-1. Unfortunately, Squeak does use ByteString to store UTF-8 (my intention was only transiently; in hindsight, it would have been a better convention to use ByteArray for this transient UTF-8 data.) > 11) The preferred Unicode representation of the characters which have > compatibility codepoints is as a a short set of codepoints > representing the characters which are combined together to form the > glyph of the convenience codepoint, as a sequence of bytes > representing the component characters. > > > 12) Some concrete examples: > > £ (GBP currency symbol) > In ISO-8859-1, not in ASCII > ASCII : A3 hex is not a valid ASCII code > UTF-8: £ - A3 hex This is 0xC2 0xA3, not A3. > Upper Case C cedilla > In ISO-8859-1, not in ASCII, in UTF-8 as a compatibility codepoint > *and* a composed set of codepoints > ASCII : C7 hex is not a valid ASCII character code > ISO-8859-1 : Upper Case C cedilla - C7 hex > UTF-8 : Upper Case C cedilla (compatibility codepoint) - C7 hex no, and, > Unicode preferred Upper Case C cedilla (composed set of codepoints) > Upper case C 0043 hex (Upper case C) > followed by > cedilla 00B8 hex (cedilla) no. The codepoint that follows is U+0327, or 0xCC 0xA7 in UTF-8. > 13) For any valid ASCII string *and* for any valid ISO-8859-1 string, > aByteString is completely adequate for editing and display. So unfortunately this is not true. > 14) When sorting any valid ASCII string *or* any valid ISO-8859-1 > string, upper and lower case versions of the same character will be > treated differently. > > 15) When sorting any valid ISO-8859-1 string containing > letter+diacritic combination glyphs or ligature combination glyphs, > the glyphs in combination will treated differently to a "plain" glyph > of the character > i.e. "C" and "C cedilla" will be treated very differently. "ß" and > "fs" will be treated very differently. The statement is true but perhaps you mean ss instead of fs? > a Utf8CompatibilityString class. > > asByteString - ensure only compatibility codepoints are used. > Ensure it doews not encode characters above 00FF hex. > > asIso8859String - ensures only compatibility codepoints are used, > and that the characters are each valid ISO 8859-1 > > asAsciiString - ensures only characters 00hex - 7F hex are used. > > asUtf8ComposedIso8859String - ensures all compatibility codepoints > are expanded into small OrderedCollections of codepoints > > a Utf8ComposedIso8859String class - will provide sortable and > comparable UTF8 strings of all ASCII and ISO 8859-1 strings. > > Then a Utf8SortableCollection class - a collection of > Utf8ComposedIso8859Strings words and phrases. > > Custom sortBlocks will define the applicable sort order. > > We can create a collection... a Dictionary, thinking about it, of > named, prefabricated sortBlocks. > > This will work for all UTF8 strings of ISO-8859-1 and ASCII strings. > > If anyone has better names for the classes, please let me know. > > If anyone else wants to help > - build these, > - create SUnit tests for these > - write documentation for these > Please let me know. My feeling is that these extra classes are totally overkill and not necessary. Unfortunately, I have not been following the discussion very closely, but what is the problem that is being solved here? -- -- Yoshiki |
In reply to this post by EuanM
Hi Henry,
To be honest, at some point I'm going to long for the for the much more succinct semantics of healthcare systems and sports scoring and administration systems again. :-) codepoints are any of *either* - the representation of a component of an abstract character, *or* eg. "A" #(0041) as a component of - the sole representation of the whole of an abstract character *or* of - a representation of an abstract character provided for backwards compatibility which is more properly represented by a series of codepoints representing a composed character e.g. The "A" #(0041) as a codepoint can be: the sole representation of the whole of an abstract character "A" #(0041) The representation of a component of the composed (i.e. preferred) version of the abstract character Å #(0041 030a) Å (#00C5) represents one valid compatibility form of the abstract character Å which is most properly represented by #(0041 030a). Å (#212b) also represents one valid compatibility form of the abstract character Å which is most properly represented by #(0041 030a). With any luck, this satisfies both our semantic understandings of the concept of "codepoint" Would you agree with that? In Unicode, codepoints are *NOT* an abstract numerical representation of a text character. At least not as we generally understand the term "text character" from our experience of non-Unicode character mappings. codepoints represent "*encoded characters*" and "a *text element* ... is represented by a sequence of one or more codepoints". (And the term "text element" is deliberately left undefined in the Unicode standard) Individual codepoints are very often *not* the encoded form of an abstract character that we are interested in. Unless we are communicating to or from another system (Which in some cases is the Smalltalk ByteString class) i.e. in other words *Some* individual codepoints *may* be a representation of a specific *abstract character*, but only in special cases. The general case in Unicode is that Unicode defines (a) representation(s) of a Unicode *abstract character*. The Unicode standard representation of an abstract character is a composed sequence of codepoints, where in some cases that sequence is as short as 1 codepoint. In other cases, Unicode has a compatibility alias of a single codepoint which is *also* a representation of an abstract character There are some cases where an abstract character can be represented by more than one single-codepoint compatibility codepoint. Cheers, Euan On 7 December 2015 at 11:11, Henrik Johansen <[hidden email]> wrote: > >> On 07 Dec 2015, at 11:51 , EuanM <[hidden email]> wrote: >> >> And indeed, in principle. >> >> On 7 December 2015 at 10:51, EuanM <[hidden email]> wrote: >>> Verifying assumptions is the key reason why you should documents like >>> this out for review. >>> >>> Sven - >>> >>> I'm confident I understand the use of UTF-8 in principal. > > I can only second Sven's sentiment that you need to better differentiate code points (an abstract numerical representation of a character, where a set of such mappings > define a charset, such as Unicode), and character encoding forms. (which are how code points are represented in bytes by a defined process such as UTF-8, UTF-16 etc). > > I know you'll probably think I'm arguing semantics again, but these are *important* semantics ;) > > Cheers, > Henry |
No. a codepoint is the numerical value assigned to a character. An "encoded character" is the way a codepoint is represented in bytes using a given encoding.
I agree you have a good grasp of the distinction between an abstract character (characters and character sequences which should be treated equivalent wrt, equality / sorting / display, etc.) and a character (which each have a code point assigned). That is besides the point both Sven and I tried to get through, which is the difference between a code point and the encoded form(s) of said code point. When you write: "and therefore encodable in UTF-8 as compatibility codepoint e9 hex and as the composed character #(0065 00b4) (all in hex) and as thesame composed character as both #(feff 0065 00b4) and #(ffef 0065 00b4) when endianness markers are included" I's quite clear you confuse the two. 0xFEFF is the codepoint of the character used as bom. When you state that it can be written ffef (I assume you meant FFFE), you are again confusing the code point and its encoded value (an encoded value which only occurs in UTF16/32, no less). When this distinction is clear, it might be easier to see that value in that Strings are kept as Unicode code points arrays, and converted to encoded forms when entering/exiting the system. Cheers, Henry signature.asc (859 bytes) Download Attachment |
Free forum by Nabble | Edit this page |