Smalltalk › Cuis Smalltalk

Unicode Support

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

5 messages Options

EuanM

Unicode Support

I'm currently groping my way to seeing how feature-complete our
Unicode support is. I am doing this to establish what still needs to
be done to provide full Unicode support.

This seems to me to be an area where it would be best to write it
once, and then have the same codebase incorporated into the Smalltalks
that most share a common ancestry.

I am keen to get: equality-testing for strings; sortability for
strings which have ligatures and diacritic characters; and correct
round-tripping of data.

Call to action:
==========

If you have comments on these proposals - such as "but we already have
that facility" or "the reason we do not have these facilities is
because they are dog-slow" - please let me know them.

If you would like to help out, please let me know.

If you have Unicode experience and expertise, and would like to be, or
would be willing to be, in the 'council of experts' for this project,
please let me know.

If you have comments or ideas on anything mentioned in this email

In the first instance, the initiative's website will be:
http://smalltalk.uk.to/unicode.html

I have created a SqueakSource.com project called UnicodeSupport

I want to avoid re-inventing any facilities which already exist.
Except where they prevent us reaching the goals of:
- sortable UTF8 strings
- sortable UTF16 strings
- equivalence testing of 2 UTF8 strings
- equivalence testing of 2 UTF16 strings
- round-tripping UTF8 strings through Smalltalk
- roundtripping UTF16 strings through Smalltalk.
As I understand it, we have limited Unicode support atm.

Current state of play
===============
ByteString gets converted to WideString when need is automagically detected.

Is there anything else that currently exists?

Definition of Terms
==============
A quick definition of terms before I go any further:

Standard terms from the Unicode standard
===============================
a compatibility character : an additional encoding of a *normal*
character, for compatibility and round-trip conversion purposes. For
instance, a 1-byte encoding of a Latin character with a diacritic.

Made-up terms
============
a convenience codepoint : a single codepoint which represents an item
that is also encoded as a string of codepoints.

(I tend to use the terms compatibility character and compatibility
codepoint interchangably. The standard only refers to them as
compatibility characters. However, the standard is determined to
emphasise that characters are abstract and that codepoints are
concrete. So I think it is often more useful and productive to think
of compatibility or convenience codepoints).

a composed character : a character made up of several codepoints

Unicode encoding explained
=====================
A convenience codepoint can therefore be thought of as a code point
used for a character which also has a composed form.

The way Unicode works is that sometimes you can encode a character in
one byte, sometimes not. Sometimes you can encode it in two bytes,
sometimes not.

You can therefore have a long stream of ASCII which is single-byte
Unicode. If there is an occasional Cyrillic or Greek character in the
stream, it would be represented either by a compatibility character or
by a multi-byte combination.

Using compatibility characters can prevent proper sorting and
equivalence testing.

Using "pure" Unicode, ie. "normal encodings", can cause compatibility
and round-tripping probelms. Although avoiding them can *also* cause
compatibility issues and round-tripping problems.

Currently my thinking is:

a Utf8String class
an Ordered collection, with 1 byte characters as the modal element,
but short arrays of wider strings where necessary
a Utf16String class
an Ordered collection, with 2 byte characters as the modal element,
but short arrays of wider strings
beginning with a 2-byte endianness indicator.

Utf8Strings sometimes need to be sortable, and sometimes need to be compatible.

So my thinking is that Utf8String will contain convenience codepoints,
for round-tripping. And where there are multiple convenience
codepoints for a character, that it standardises on one.

And that there is a Utf8SortableString which uses *only* normal characters.

We then need methods to convert between the two.

aUtf8String asUtf8SortableString

and

aUtf8SortableString asUtf8String

Sort orders are culture and context dependent - Sweden and Germany
have different sort orders for the same diacritic-ed characters. Some
countries have one order in general usage, and another for specific
usages, such as phone directories (e.g. UK and France)

Similarly for Utf16 : Utf16String and Utf16SortableString and
conversion methods

A list of sorted words would be a SortedCollection, and there could be
pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder,
seOrder, ukOrder, etc

along the lines of
aListOfWords := SortedCollection sortBlock: deOrder

If a word is either a Utf8SortableString, or a well-formed Utf8String,
then we can perform equivalence testing on them trivially.

To make sure a Utf8String is well formed, we would need to have a way
of cleaning up any convenience codepoints which were valid, but which
were for a character which has multiple equally-valid alternative
convenience codepoints, and for which the string currently had the
"wrong" convenience codepoint. (i.e for any character with valid
alternative convenience codepoints, we would choose one to be in the
well-formed Utf8String, and we would need a method for cleaning the
alternative convenience codepoints out of the string, and replacing
them with the chosen approved convenience codepoint.

aUtf8String cleanUtf8String

With WideString, a lot of the issues disappear - except
round-tripping(although I'm sure I have seen something recently about
4-byte strings that also have an additional bit. Which would make
some Unicode characters 5-bytes long.)

(I'm starting to zone out now - if I've overlooked anything - obvious,
subtle, or somewhere in between, please let me know)

Cheers,
Euan

_______________________________________________
Cuis mailing list
[hidden email]
http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org

Juan Vuletich-4

Re: Unicode Support

Hi Euan,

On 12/4/2015 8:42 AM, EuanM wrote:

> I'm currently groping my way to seeing how feature-complete our
> Unicode support is. I am doing this to establish what still needs to
> be done to provide full Unicode support.
>
> This seems to me to be an area where it would be best to write it
> once, and then have the same codebase incorporated into the Smalltalks
> that most share a common ancestry.
>
> I am keen to get: equality-testing for strings; sortability for
> strings which have ligatures and diacritic characters; and correct
> round-tripping of data.
>

Current state (as I understand it):

- Squeak: M17N by Yoshiki

- Pharo: Inherited Squeak code. I don't know how much it has diverged.

- Cuis: Chose not to use Squeak approach. Chose to make the base image
include and use only 1-byte strings. Chose to use ISO-8859-15, the most
complete standard for the Latin alphabet, including diacritics, etc.
Includes limited support for Unicode. See
https://github.com/Cuis-Smalltalk/Cuis-Smalltalk-Dev/blob/master/Documentation/UnicodeNotes.md

Unicode is a complex area, and no system can claim to do it all right. I
see several possible levels of Unicode support. These are (with
Smalltalk examples):

0) No Unicode support at all. Like ST-80 or Squeak previous to 3.7.

1) Limited Unicode support (as in Cuis). Can handle Unicode character in
Strings. Comfortable Display / edit of text is restricted to Latin
alphabet. Non ISO-8859-15 characters are represented as NCRs, but are
not instances of Character themselves. For example, an NCR such as
'α' (made of 5 8-bit Characters) represents the greek letter Alpha,
and is properly handled if such string is converted to an UTF-8
ByteArray (for example for copying into the Clipboard or for serving Web
pages). In short, you can not directly edit or display general Unicode,
but you can embed it in code, include it in Strings, copy&paste, and
serve web pages with it.

2) Ken's Cuis-Smalltalk-Unicode. Can display and edit. Includes the
great Ropes representation for Strings. Limited font support.

3) Squeak. Can display and edit Unicode strings. Includes broad font
support with TrueType / OpenType. Does not do grapheme composition,
right to left, or ligatures.

4) Scratch. Most complete support, including all that's missing in the
previous approaches. Uses PanGo. This means that text composition is not
in Smalltalk, but in a large and complex external library. This sounds
appropriate: Full Unicode is so complex that it took PanGo many years do
it reasonably well, and most projects rely on it.

I believe that your objectives can be done both with Squeak's approach
and with Cuis' approach. Cuis, for instance, is only missing methods for
normalization (for characters with multiple code point representations)
and sorting; all done on UTF-8 ByteArrays.

Cheers,
Juan Vuletich

References:
NCR: http://en.wikipedia.org/wiki/Numeric_character_reference
Ken's: https://github.com/KenDickey/Cuis-Smalltalk-Unicode

> Call to action:
> ==========
>
> If you have comments on these proposals - such as "but we already have
> that facility" or "the reason we do not have these facilities is
> because they are dog-slow" - please let me know them.
>
> If you would like to help out, please let me know.
>
> If you have Unicode experience and expertise, and would like to be, or
> would be willing to be, in the 'council of experts' for this project,
> please let me know.
>
> If you have comments or ideas on anything mentioned in this email
>
> In the first instance, the initiative's website will be:
> http://smalltalk.uk.to/unicode.html
>
> I have created a SqueakSource.com project called UnicodeSupport
>
> I want to avoid re-inventing any facilities which already exist.
> Except where they prevent us reaching the goals of:
> - sortable UTF8 strings
> - sortable UTF16 strings
> - equivalence testing of 2 UTF8 strings
> - equivalence testing of 2 UTF16 strings
> - round-tripping UTF8 strings through Smalltalk
> - roundtripping UTF16 strings through Smalltalk.
> As I understand it, we have limited Unicode support atm.
>
> Current state of play
> ===============
> ByteString gets converted to WideString when need is automagically detected.
>
> Is there anything else that currently exists?
>
> Definition of Terms
> ==============
> A quick definition of terms before I go any further:
>
> Standard terms from the Unicode standard
> ===============================
> a compatibility character : an additional encoding of a *normal*
> character, for compatibility and round-trip conversion purposes. For
> instance, a 1-byte encoding of a Latin character with a diacritic.
>
> Made-up terms
> ============
> a convenience codepoint : a single codepoint which represents an item
> that is also encoded as a string of codepoints.
>
> (I tend to use the terms compatibility character and compatibility
> codepoint interchangably. The standard only refers to them as
> compatibility characters. However, the standard is determined to
> emphasise that characters are abstract and that codepoints are
> concrete. So I think it is often more useful and productive to think
> of compatibility or convenience codepoints).
>
> a composed character : a character made up of several codepoints
>
> Unicode encoding explained
> =====================
> A convenience codepoint can therefore be thought of as a code point
> used for a character which also has a composed form.
>
> The way Unicode works is that sometimes you can encode a character in
> one byte, sometimes not. Sometimes you can encode it in two bytes,
> sometimes not.
>
> You can therefore have a long stream of ASCII which is single-byte
> Unicode. If there is an occasional Cyrillic or Greek character in the
> stream, it would be represented either by a compatibility character or
> by a multi-byte combination.
>
> Using compatibility characters can prevent proper sorting and
> equivalence testing.
>
> Using "pure" Unicode, ie. "normal encodings", can cause compatibility
> and round-tripping probelms. Although avoiding them can *also* cause
> compatibility issues and round-tripping problems.
>
> Currently my thinking is:
>
> a Utf8String class
> an Ordered collection, with 1 byte characters as the modal element,
> but short arrays of wider strings where necessary
> a Utf16String class
> an Ordered collection, with 2 byte characters as the modal element,
> but short arrays of wider strings
> beginning with a 2-byte endianness indicator.
>
> Utf8Strings sometimes need to be sortable, and sometimes need to be compatible.
>
> So my thinking is that Utf8String will contain convenience codepoints,
> for round-tripping. And where there are multiple convenience
> codepoints for a character, that it standardises on one.
>
> And that there is a Utf8SortableString which uses *only* normal characters.
>
> We then need methods to convert between the two.
>
> aUtf8String asUtf8SortableString
>
> and
>
> aUtf8SortableString asUtf8String
>
>
> Sort orders are culture and context dependent - Sweden and Germany
> have different sort orders for the same diacritic-ed characters. Some
> countries have one order in general usage, and another for specific
> usages, such as phone directories (e.g. UK and France)
>
> Similarly for Utf16 : Utf16String and Utf16SortableString and
> conversion methods
>
> A list of sorted words would be a SortedCollection, and there could be
> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder,
> seOrder, ukOrder, etc
>
> along the lines of
> aListOfWords := SortedCollection sortBlock: deOrder
>
> If a word is either a Utf8SortableString, or a well-formed Utf8String,
> then we can perform equivalence testing on them trivially.
>
> To make sure a Utf8String is well formed, we would need to have a way
> of cleaning up any convenience codepoints which were valid, but which
> were for a character which has multiple equally-valid alternative
> convenience codepoints, and for which the string currently had the
> "wrong" convenience codepoint. (i.e for any character with valid
> alternative convenience codepoints, we would choose one to be in the
> well-formed Utf8String, and we would need a method for cleaning the
> alternative convenience codepoints out of the string, and replacing
> them with the chosen approved convenience codepoint.
>
> aUtf8String cleanUtf8String
>
> With WideString, a lot of the issues disappear - except
> round-tripping(although I'm sure I have seen something recently about
> 4-byte strings that also have an additional bit. Which would make
> some Unicode characters 5-bytes long.)
>
>
> (I'm starting to zone out now - if I've overlooked anything - obvious,
> subtle, or somewhere in between, please let me know)
>
> Cheers,
> Euan

_______________________________________________
Cuis mailing list
[hidden email]
http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org

KenDickey

Re: Unicode Support

In reply to this post by EuanM

On Fri, 4 Dec 2015 11:42:11 +0000
EuanM <[hidden email]> wrote:

> I'm currently groping my way to seeing how feature-complete our
> Unicode support is. I am doing this to establish what still needs to
> be done to provide full Unicode support.

Moby work!

I did some prototyping to get an idea of the scale of work using a simple Unicode strikefont

https://github.com/KenDickey/Cuis-Smalltalk-Unicode

Useful to be able to do simple, very limited, editing of web pages.

The prototype is notable in using immutable strings (ropes) to allow mixing characters of different sizes. E.g. adding a 32 bit code point to an ASCII string does not convert the string from narrow to wide characters. The theory is that much editing adds a small amount of text to a relatively large document, so immutable strings can save a lot of space.

My impression from doing this basic bit is that doing a decent job with ligatures, multi-directional text, and layouts adds a significant amount of code and data and I would guess doubles the complexity of the system. I.e. _good_ Unicode support is on the order of the Smalltalk VM and current core class set in terms of size and complexity.

Commercially, this is done over time by fairly well funded teams of people.

===

So it is probably useful to talk a bit about what you want to accomplish and the relative costs.

One decision is the level of support.
[A] Is Unicode for the image itself (replace ASCII)?
[B] Is Unicode to edit and maintain Web Pages (page editor)?
[C] Is Unicode to support general fonts for page layout (LibreOffice/Word/..)

I like Juan's list:

> 0) No Unicode support at all. Like ST-80 or Squeak previous to 3.7.

+ Cheap (done!)
- Non-existant (less useful)

> 1) Limited Unicode support (as in Cuis). Can handle Unicode character in
Strings. Comfortable Display / edit of text is restricted to Latin
alphabet. Non ISO-8859-15 characters are represented as NCRs, but are
not instances of Character themselves. For example, an NCR such as
'α' (made of 5 8-bit Characters) represents the greek letter Alpha,
and is properly handled if such string is converted to an UTF-8
ByteArray (for example for copying into the Clipboard or for serving Web
pages). In short, you can not directly edit or display general Unicode,
but you can embed it in code, include it in Strings, copy&paste, and
serve web pages with it.

+ Fairly Inexpensive
- No unifonts for editors [but could add (4)!]

> 2) Ken's Cuis-Smalltalk-Unicode. Can display and edit. Includes the
great Ropes representation for Strings. Limited font support.

+ Moderately Expensive (Complex, tables take a large amount of space)
- Probably slower in Smalltalk than external library (4)
- Strikefonts do not do ligatures

> 3) Squeak. Can display and edit Unicode strings. Includes broad font
support with TrueType / OpenType. Does not do grapheme composition,
right to left, or ligatures.

+ Does a good basic job for text display
- Code base could use some cleanup
- Getting bi-directional sorting, editing support

I have not tried to use this to edit large documents, perhaps the Pharo/Seaside communities have experience here.

> 4) Scratch. Most complete support, including all that's missing in the
previous approaches. Uses PanGo. This means that text composition is not
in Smalltalk, but in a large and complex external library. This sounds
appropriate: Full Unicode is so complex that it took PanGo many years do
it reasonably well, and most projects rely on it.

+ Reuses decades of design and implementation experience.
- Relies on external libraries

Personally, I think this is least cost for the benefit.

==

Note that there are more compact encodings of glyph layouts than fonts. See "character description language".

$0.02,
-KenD

_______________________________________________
Cuis mailing list
[hidden email]
http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org

-KenD

Hannes Hirzel

Fwd: [Pharo-dev] [squeak-dev] Unicode Support

In reply to this post by EuanM

---------- Forwarded message ----------
From: Todd Blanchard <[hidden email]>
Date: Sun, 06 Dec 2015 08:37:12 -0800
Subject: Re: [Pharo-dev] [squeak-dev] Unicode Support
To: Pharo Development List <[hidden email]>

(Resent because of bounce notification (email handling in osx is
really beginning to annoy me). Sorry if its a dup)

I used to worry a lot about strings being indexable. And then I
eventually let go of that and realized that it isn't a particularly
important property for them to have.

I think you will find that UTF8 is generally the most convenient for a
lot of things but its a bit like light in that you treat it
alternately as a wave or particle depending on what you are trying to
do.

So goes strings - they can be treated alternately as streams or byte
arrays (not character arrays - stop thinking in characters). In
practice, this tends to not be a problem since a lot of the times when
you want to replace a character or pick out the nth one you are doing
something very computerish and the characters you are working with are
the single byte (ASCII legacy) variety. You generally know when you
can get away with that and when you can't.

Otherwise you are most likely doing things that are best dealt with in
a streaming paradigm. For most computation, you come to realize you
don't generally care how many characters but how much space (bytes)
you need to store your chunk of text. Collation is tricky and
complicated in unicode in general but it isn't any worse in UTF8 than
any other encoding. You are still going to scan each sortable item
from front to back to determine its order, regardless.

Most of the outside world has settled on UTF8 and any ASCII file is
already UTF8 - which is why it ends up being so convenient. Most of
our old text handling infrastructure can still handle UTF8 while it
tends to choke on wider encodings.

-Todd Blanchard

> On Dec 6, 2015, at 07:23, H. Hirzel <[hidden email]> wrote:
>
>> We do the same thing, but that doesn't mean it's a good idea to create a
>> new String-like class having its content encoded in UTF-8, because
>> UTF-8-encoded strings can't be modified like regular strings. While it
>> would be possible to implement all operations, such implementation would
>> become the next SortedCollection (bad performance due to misuse).

_______________________________________________
Cuis mailing list
[hidden email]
http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org

EuanM

Re: [Pharo-dev] Unicode Support

In reply to this post by EuanM

Hi Henry,

To be honest, at some point I'm going to long for the for the much
more succinct semantics of healthcare systems and sports scoring and
administration systems again. :-)

codepoints are any of *either*
- the representation of a component of an abstract character, *or*
eg. "A" #(0041) as a component of
- the sole representation of the whole of an abstract character *or* of
- a representation of an abstract character provided for backwards
compatibility which is more properly represented by a series of
codepoints representing a composed character

e.g.

The "A" #(0041) as a codepoint can be:
the sole representation of the whole of an abstract character "A" #(0041)

The representation of a component of the composed (i.e. preferred)
version of the abstract character Å #(0041 030a)

Å (#00C5) represents one valid compatibility form of the abstract
character Å which is most properly represented by #(0041 030a).

Å (#212b) also represents one valid compatibility form of the abstract
character Å which is most properly represented by #(0041 030a).

With any luck, this satisfies both our semantic understandings of the
concept of "codepoint"

Would you agree with that?

In Unicode, codepoints are *NOT* an abstract numerical representation
of a text character.

At least not as we generally understand the term "text character" from
our experience of non-Unicode character mappings.

codepoints represent "*encoded characters*" and "a *text element* ...
is represented by a sequence of one or more codepoints". (And the
term "text element" is deliberately left undefined in the Unicode
standard)

Individual codepoints are very often *not* the encoded form of an
abstract character that we are interested in. Unless we are
communicating to or from another system (Which in some cases is the
Smalltalk ByteString class)

i.e. in other words

*Some* individual codepoints *may* be a representation of a specific
*abstract character*, but only in special cases.

The general case in Unicode is that Unicode defines (a)
representation(s) of a Unicode *abstract character*.

The Unicode standard representation of an abstract character is a
composed sequence of codepoints, where in some cases that sequence is
as short as 1 codepoint.

In other cases, Unicode has a compatibility alias of a single
codepoint which is *also* a representation of an abstract character

There are some cases where an abstract character can be represented by
more than one single-codepoint compatibility codepoint.

Cheers,
Euan

On 7 December 2015 at 11:11, Henrik Johansen
<[hidden email]> wrote:

>
>> On 07 Dec 2015, at 11:51 , EuanM <[hidden email]> wrote:
>>
>> And indeed, in principle.
>>
>> On 7 December 2015 at 10:51, EuanM <[hidden email]> wrote:
>>> Verifying assumptions is the key reason why you should documents like
>>> this out for review.
>>>
>>> Sven -
>>>
>>> I'm confident I understand the use of UTF-8 in principal.
>
> I can only second Sven's sentiment that you need to better differentiate code points (an abstract numerical representation of a character, where a set of such mappings
> define a charset, such as Unicode), and character encoding forms. (which are how code points are represented in bytes by a defined process such as UTF-8, UTF-16 etc).
>
> I know you'll probably think I'm arguing semantics again, but these are *important* semantics ;)
>
> Cheers,
> Henry

_______________________________________________
Cuis mailing list
[hidden email]
http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org