Smalltalk › Squeak › Squeak - Dev

Unicode Support

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

41 messages Options

123

Henrik Sperre Johansen

Re: [Pharo-dev] Unicode Support

On 07 Dec 2015, at 2:06 , Henrik Johansen <[hidden email]> wrote:

codepoints represent "*encoded characters*"

No. a codepoint is the numerical value assigned to a character. An "encoded character" is the way a codepoint is represented in bytes using a given encoding.

You were right on this point, I see I remembered the terminology of this incorrectly.

http://www.unicode.org/versions/Unicode8.0.0/ch02.pdf figure 2.8 does use "encoded characters" for the mapping of abstract characters to its equivalent codepoint (s/ sequences). What I thought it meant is better described as a codepoint's byte output using an "encoding scheme".

An accurate description following that terminology, would be that Pharo/Squeak Strings keep data in UTF32 encoding form, where 1 codepoint = 1 code unit, dynamically switched between Latin1 (ByteStrings) and UTF32 (WideStrings) encoding schemes as needed.

With the same terminology, the difference between a code point, a code unit, how an encoding scheme represents a codepoint as code units/bytes, are the concepts it is important to distinguish.

Quite a mouthful though!

Cheers

Henry

signature.asc (859 bytes) Download Attachment

Ben Coman

Re: Unicode Support

In reply to this post by EuanM

On Fri, Dec 4, 2015 at 7:42 PM, EuanM <[hidden email]> wrote:

> I'm currently groping my way to seeing how feature-complete our
> Unicode support is. I am doing this to establish what still needs to
> be done to provide full Unicode support.
>
> This seems to me to be an area where it would be best to write it
> once, and then have the same codebase incorporated into the Smalltalks
> that most share a common ancestry.
>
> I am keen to get: equality-testing for strings; sortability for
> strings which have ligatures and diacritic characters; and correct
> round-tripping of data.
>
> Call to action:
> ==========
>
> If you have comments on these proposals - such as "but we already have
> that facility" or "the reason we do not have these facilities is
> because they are dog-slow" - please let me know them.
>
> If you would like to help out, please let me know.
>
> If you have Unicode experience and expertise, and would like to be, or
> would be willing to be, in the 'council of experts' for this project,
> please let me know.
>
> If you have comments or ideas on anything mentioned in this email
>
> In the first instance, the initiative's website will be:
> http://smalltalk.uk.to/unicode.html
>
> I have created a SqueakSource.com project called UnicodeSupport
>
> I want to avoid re-inventing any facilities which already exist.
> Except where they prevent us reaching the goals of:
> - sortable UTF8 strings
> - sortable UTF16 strings
> - equivalence testing of 2 UTF8 strings
> - equivalence testing of 2 UTF16 strings
> - round-tripping UTF8 strings through Smalltalk
> - roundtripping UTF16 strings through Smalltalk.
> As I understand it, we have limited Unicode support atm.
>
> Current state of play
> ===============
> ByteString gets converted to WideString when need is automagically detected.
>
> Is there anything else that currently exists?
>
> Definition of Terms
> ==============
> A quick definition of terms before I go any further:
>
> Standard terms from the Unicode standard
> ===============================
> a compatibility character : an additional encoding of a *normal*
> character, for compatibility and round-trip conversion purposes. For
> instance, a 1-byte encoding of a Latin character with a diacritic.
>
> Made-up terms
> ============
> a convenience codepoint : a single codepoint which represents an item
> that is also encoded as a string of codepoints.
>
> (I tend to use the terms compatibility character and compatibility
> codepoint interchangably. The standard only refers to them as
> compatibility characters. However, the standard is determined to
> emphasise that characters are abstract and that codepoints are
> concrete. So I think it is often more useful and productive to think
> of compatibility or convenience codepoints).
>
> a composed character : a character made up of several codepoints
>
> Unicode encoding explained
> =====================
> A convenience codepoint can therefore be thought of as a code point
> used for a character which also has a composed form.
>
> The way Unicode works is that sometimes you can encode a character in
> one byte, sometimes not. Sometimes you can encode it in two bytes,
> sometimes not.
>
> You can therefore have a long stream of ASCII which is single-byte
> Unicode. If there is an occasional Cyrillic or Greek character in the
> stream, it would be represented either by a compatibility character or
> by a multi-byte combination.
>
> Using compatibility characters can prevent proper sorting and
> equivalence testing.
>
> Using "pure" Unicode, ie. "normal encodings", can cause compatibility
> and round-tripping probelms. Although avoiding them can *also* cause
> compatibility issues and round-tripping problems.
>
> Currently my thinking is:
>
> a Utf8String class
> an Ordered collection, with 1 byte characters as the modal element,
> but short arrays of wider strings where necessary
> a Utf16String class
> an Ordered collection, with 2 byte characters as the modal element,
> but short arrays of wider strings
> beginning with a 2-byte endianness indicator.
>
> Utf8Strings sometimes need to be sortable, and sometimes need to be compatible.
>
> So my thinking is that Utf8String will contain convenience codepoints,
> for round-tripping. And where there are multiple convenience
> codepoints for a character, that it standardises on one.
>
> And that there is a Utf8SortableString which uses *only* normal characters.
>
> We then need methods to convert between the two.
>
> aUtf8String asUtf8SortableString
>
> and
>
> aUtf8SortableString asUtf8String
>
>
> Sort orders are culture and context dependent - Sweden and Germany
> have different sort orders for the same diacritic-ed characters. Some
> countries have one order in general usage, and another for specific
> usages, such as phone directories (e.g. UK and France)
>
> Similarly for Utf16 : Utf16String and Utf16SortableString and
> conversion methods
>
> A list of sorted words would be a SortedCollection, and there could be
> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder,
> seOrder, ukOrder, etc
>
> along the lines of
> aListOfWords := SortedCollection sortBlock: deOrder
>
> If a word is either a Utf8SortableString, or a well-formed Utf8String,
> then we can perform equivalence testing on them trivially.
>
> To make sure a Utf8String is well formed, we would need to have a way
> of cleaning up any convenience codepoints which were valid, but which
> were for a character which has multiple equally-valid alternative
> convenience codepoints, and for which the string currently had the
> "wrong" convenience codepoint. (i.e for any character with valid
> alternative convenience codepoints, we would choose one to be in the
> well-formed Utf8String, and we would need a method for cleaning the
> alternative convenience codepoints out of the string, and replacing
> them with the chosen approved convenience codepoint.
>
> aUtf8String cleanUtf8String
>
> With WideString, a lot of the issues disappear - except
> round-tripping(although I'm sure I have seen something recently about
> 4-byte strings that also have an additional bit. Which would make
> some Unicode characters 5-bytes long.)
>
>
> (I'm starting to zone out now - if I've overlooked anything - obvious,
> subtle, or somewhere in between, please let me know)
>
> Cheers,
> Euan
>

Good initiative. Here is some info I've bookmarked over time...

http://www.joelonsoftware.com/articles/Unicode.html
>> The Single Most Important Fact About Encodings - It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that "plain" text is ASCII. There Ain't No Such Thing As Plain Text. If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.

http://kunststube.net/encoding/
>> So what does it mean for a language to natively support or not support Unicode? It basically refers to whether the language assumes that one character equals one byte or not.
>> What does it mean for a language to support Unicode then? Javascript for example supports Unicode. In fact, any string in Javascript is UTF-16 encoded. In fact, it's the only thing Javascript deals with. You cannot have a string in Javascript that is not UTF-16 encoded. Javascript worships Unicode to the extent that there's no facility to deal with any other encoding in the core language.
>> Other languages are simply encoding-aware. Internally they store strings in a particular encoding,

http://cafe.elharo.com/programming/the-ten-commandments-of-unicode/
>> 1. I am Unicode, thy character set. Thou shalt have no other character sets before me.
>> 4. Thou shalt not refer to Unicode as a two-byte character set.
>> 6. Thou shalt count and index Unicode characters, not UTF-16 code points.
>> 7. Thou shalt use UTF-8 as the preferred encoding wherever possible.

https://xkcd.com/1137/
>> ruomuh

https://developer.apple.com/library/prerelease/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html
>> Strings and Characters in Swift

http://oleb.net/blog/2014/07/swift-strings/
Interesting article, though I don't yet grok it all, I thought it
worth sharing...
>> Swiftʼs string implemenation makes working with Unicode easier and significantly less error-prone.
>> Strings in Swift are represented by the String type. A String is a collection of Charactervalues. A Swift Character represents one perceived character (what a person thinks of as a single character, called a grapheme). Since Unicode often uses two or more code points(called a grapheme cluster) to form one perceived character, this implies that a Charactercan be composed of multiple Unicode scalar values if they form a single grapheme cluster. (Unicode scalar is the term for any Unicode code point except surrogate pair characters, which are used to encode UTF-16.)
>> This change has the potential to prevent many common errors when dealing with string lengths or substrings. It is a huge difference to most other Unicode-aware string libraries (including NSString) where the building blocks of a string are usually UTF-16 code units or single Unicode scalars.

cheers -ben

Ben Coman

Fwd: [squeak-dev] Unicode Support

Sorry to double post, I missed the cross-post.

On Fri, Dec 4, 2015 at 7:42 PM, EuanM <[hidden email]> wrote:

Ben Coman

Re: [Pharo-dev] Unicode Support

In reply to this post by Henrik Sperre Johansen

On Mon, Dec 7, 2015 at 10:48 PM, Henrik Johansen
<[hidden email]> wrote:

>
> On 07 Dec 2015, at 2:06 , Henrik Johansen <[hidden email]>
> wrote:
>
>
> codepoints represent "*encoded characters*"
>
>
> No. a codepoint is the numerical value assigned to a character. An "encoded
> character" is the way a codepoint is represented in bytes using a given
> encoding.
>
>
> You were right on this point, I see I remembered the terminology of this
> incorrectly.
> http://www.unicode.org/versions/Unicode8.0.0/ch02.pdf figure 2.8 does use
> "encoded characters" for the mapping of abstract characters to its
> equivalent codepoint (s/ sequences). What I thought it meant is better
> described as a codepoint's byte output using an "encoding scheme".
>
> An accurate description following that terminology, would be that
> Pharo/Squeak Strings keep data in UTF32 encoding form, where 1 codepoint = 1
> code unit, dynamically switched between Latin1 (ByteStrings) and UTF32
> (WideStrings) encoding schemes as needed.

The implication from Joel's unicode article (linked from my other
post) is that whatever encoding we use to store strings, the encoding
should not be implicit (i.e. by convention defined outside the image).
*Every* string needs to record its encoding. Maybe we should follow
Swift [1] and have Characters comprised of multiple codepoints, and/or
a String be able to handle a sequence of differently encoded
Characters, i.e. String being a mixed sequence of UTF-8, UTF-16,
UTF-32 Characters. I have no idea what that wold do for efficiency,
but maybe let Moore's Law handle that.

[1] http://oleb.net/blog/2014/07/swift-strings/
>> A Swift Character represents one perceived character (what a person thinks of as a single character, called a grapheme). Since Unicode often uses two or more code points(called a grapheme cluster) to form one perceived character, this implies that a Charactercan be composed of multiple Unicode scalar values if they form a single grapheme cluster.

cheers -ben

> With the same terminology, the difference between a code point, a code unit,
> how an encoding scheme represents a codepoint as code units/bytes, are the
> concepts it is important to distinguish.
> Quite a mouthful though!

EuanM

Re: [Pharo-dev] Unicode Support

In reply to this post by Henrik Sperre Johansen

"To encode Unicode for external representation as bytes, we use UTF-8
like the rest of the modern world.

So far, so good.

Why all the confusion ?"

The confusion arises because simply providing *a* valid UTF-8 encoding
of does not ensure sortability, nor equivalence testability.

It might provide sortable strings. It might not.

It might provide a string that can be compared to another string
successfully. It might not.

So being able to perform valid UTF-8 encoding is *necessary*, but *not
sufficient*.

i.e. the confusion arises because UTF-8 can provide for several
competing, non-sortable encodings of even a single character. This
means that *valid* UTF-8 cannot be relied upon to provide these
facilities *unless* all the UTF-8 strings can be relied upon to have
been encoded to UTF-8 by the same specification of process. i.e.
*unless* it has gone through a process of being converted by *a
specific* valid method of encoding to UTF-8.

Understanding the concept of abstract character is, imo key to
understanding the differences between the various valid UTF-8 forms of
a given abstract character.

Cheers,
Euan

On 9 December 2015 at 10:45, Sven Van Caekenberghe <[hidden email]> wrote:

>
>> On 09 Dec 2015, at 10:35, Guillermo Polito <[hidden email]> wrote:
>>
>>
>>> On 8 dic 2015, at 10:07 p.m., EuanM <[hidden email]> wrote:
>>>
>>> "No. a codepoint is the numerical value assigned to a character. An
>>> "encoded character" is the way a codepoint is represented in bytes
>>> using a given encoding."
>>>
>>> No.
>>>
>>> A codepoint may represent a component part of an abstract character,
>>> or may represent an abstract character, or it may do both (but not
>>> always at the same time).
>>>
>>> Codepoints represent a single encoding of a single concept.
>>>
>>> Sometimes that concept represents a whole abstract character.
>>> Sometimes it represent part of an abstract character.
>>
>> Well. I do not agree with this. I agree with the quote.
>>
>> Can you explain a bit more about what you mean by abstract character and concept?
>
> I am pretty sure that this whole discussion does more harm than good for most people's understanding of Unicode.
>
> It is best and (mostly) correct to think of a Unicode string as a sequence of Unicode characters, each defined/identified by a code point (out of 10.000s covering all languages). That is what we have today in Pharo (with the distinction between ByteString and WideString as mostly invisible implementation details).
>
> To encode Unicode for external representation as bytes, we use UTF-8 like the rest of the modern world.
>
> So far, so good.
>
> Why all the confusion ? Because the world is a complex place and the Unicode standard tries to cover all possible things. Citing all these exceptions and special cases will make people crazy and give up. I am sure that most stopped reading this thread.
>
> Why then is there confusion about the seemingly simple concept of a character ? Because Unicode allows different ways to say the same thing. The simplest example in a common language is (the French letter é) is
>
> LATIN SMALL LETTER E WITH ACUTE [U+00E9]
>
> which can also be written as
>
> LATIN SMALL LETTER E [U+0065] followed by COMBINING ACUTE ACCENT [U+0301]
>
> The former being a composed normal form, the latter a decomposed normal form. (And yes, it is even much more complicated than that, it goes on for 1000s of pages).
>
> In the above example, the concept of character/string is indeed fuzzy.
>
> HTH,
>
> Sven
>
>

EuanM

Fwd: [Pharo-dev] Unicode Support

In reply to this post by Henrik Sperre Johansen

"Well. I do not agree with this. I agree with the quote.

Can you explain a bit more about what you mean by abstract character
and concept?" --Guillermo

The problem with the quote is, that *while true*, it *does not
disambiguate* between:
either
compatibility character and abstract character;
or
character as composable component of an abstract character and
character as the entire
embodiment of an abstract character;

Abstract character is the key concept of Unicode. Differentiation
between abstract character and codepoints is the key differentiator of
the Unicode approach and most previous approaches to character
encoding, e,g, ASCII, EBCDIC, ISO Latin 1, etc

Please see my previous posts which use the example of Angstrom,
Capital A with circle (or whatever the canonical name is) and the
composed sequence of "Capital A" and "circle above a letter" for a
fuller explanation of the concept of "abstract character".

On 9 December 2015 at 09:35, Guillermo Polito <[hidden email]> wrote:

>
>> On 8 dic 2015, at 10:07 p.m., EuanM <[hidden email]> wrote:
>>
>> "No. a codepoint is the numerical value assigned to a character. An
>> "encoded character" is the way a codepoint is represented in bytes
>> using a given encoding."
>>
>> No.
>>
>> A codepoint may represent a component part of an abstract character,
>> or may represent an abstract character, or it may do both (but not
>> always at the same time).
>>
>> Codepoints represent a single encoding of a single concept.
>>
>> Sometimes that concept represents a whole abstract character.
>> Sometimes it represent part of an abstract character.
>
> Well. I do not agree with this. I agree with the quote.
>
> Can you explain a bit more about what you mean by abstract character and concept?
>
>>
>> This is the key difference between Unicode and most character encodings.
>>
>> A codepoint does not always represent a whole character.
>>
>> On 7 December 2015 at 13:06, Henrik Johansen
>> <[hidden email]> wrote:
>>>
>>> On 07 Dec 2015, at 1:05 , EuanM <[hidden email]> wrote:
>>>
>>> Hi Henry,
>>>
>>> To be honest, at some point I'm going to long for the for the much
>>> more succinct semantics of healthcare systems and sports scoring and
>>> administration systems again. :-)
>>>
>>> codepoints are any of *either*
>>> - the representation of a component of an abstract character, *or*
>>> eg. "A" #(0041) as a component of
>>> - the sole representation of the whole of an abstract character *or* of
>>> - a representation of an abstract character provided for backwards
>>> compatibility which is more properly represented by a series of
>>> codepoints representing a composed character
>>>
>>> e.g.
>>>
>>> The "A" #(0041) as a codepoint can be:
>>> the sole representation of the whole of an abstract character "A" #(0041)
>>>
>>> The representation of a component of the composed (i.e. preferred)
>>> version of the abstract character Å #(0041 030a)
>>>
>>> Å (#00C5) represents one valid compatibility form of the abstract
>>> character Å which is most properly represented by #(0041 030a).
>>>
>>> Å (#212b) also represents one valid compatibility form of the abstract
>>> character Å which is most properly represented by #(0041 030a).
>>>
>>> With any luck, this satisfies both our semantic understandings of the
>>> concept of "codepoint"
>>>
>>> Would you agree with that?
>>>
>>> In Unicode, codepoints are *NOT* an abstract numerical representation
>>> of a text character.
>>>
>>> At least not as we generally understand the term "text character" from
>>> our experience of non-Unicode character mappings.
>>>
>>>
>>> I agree, they are numerical representations of what Unicode refers to as
>>> characters.
>>>
>>>
>>> codepoints represent "*encoded characters*"
>>>
>>>
>>> No. a codepoint is the numerical value assigned to a character. An "encoded
>>> character" is the way a codepoint is represented in bytes using a given
>>> encoding.
>>>
>>> and "a *text element* ...
>>> is represented by a sequence of one or more codepoints". (And the
>>> term "text element" is deliberately left undefined in the Unicode
>>> standard)
>>>
>>> Individual codepoints are very often *not* the encoded form of an
>>> abstract character that we are interested in. Unless we are
>>> communicating to or from another system (Which in some cases is the
>>> Smalltalk ByteString class)
>>>
>>>
>>>
>>>
>>> i.e. in other words
>>>
>>> *Some* individual codepoints *may* be a representation of a specific
>>> *abstract character*, but only in special cases.
>>>
>>> The general case in Unicode is that Unicode defines (a)
>>> representation(s) of a Unicode *abstract character*.
>>>
>>> The Unicode standard representation of an abstract character is a
>>> composed sequence of codepoints, where in some cases that sequence is
>>> as short as 1 codepoint.
>>>
>>> In other cases, Unicode has a compatibility alias of a single
>>> codepoint which is *also* a representation of an abstract character
>>>
>>> There are some cases where an abstract character can be represented by
>>> more than one single-codepoint compatibility codepoint.
>>>
>>> Cheers,
>>> Euan
>>>
>>>
>>> I agree you have a good grasp of the distinction between an abstract
>>> character (characters and character sequences which should be treated
>>> equivalent wrt, equality / sorting / display, etc.) and a character (which
>>> each have a code point assigned).
>>> That is besides the point both Sven and I tried to get through, which is the
>>> difference between a code point and the encoded form(s) of said code point.
>>> When you write:
>>> "and therefore encodable in UTF-8 as compatibility codepoint e9 hex
>>> and as the composed character #(0065 00b4) (all in hex) and as the
>>> same composed character as both
>>> #(feff 0065 00b4) and #(ffef 0065 00b4) when endianness markers are
>>> included"
>>>
>>> I's quite clear you confuse the two. 0xFEFF is the codepoint of the
>>> character used as bom.
>>> When you state that it can be written ffef (I assume you meant FFFE), you
>>> are again confusing the code point and its encoded value (an encoded value
>>> which only occurs in UTF16/32, no less).
>>>
>>> When this distinction is clear, it might be easier to see that value in that
>>> Strings are kept as Unicode code points arrays, and converted to encoded
>>> forms when entering/exiting the system.
>>>
>>> Cheers,
>>> Henry
>>>
>>
>
>

EuanM

Re: [Pharo-dev] Unicode Support

In reply to this post by Henrik Sperre Johansen

I agree with all of that, Ben.

I'm currently fairly certain that fully-composed abstract characters
is a term that is 1:1 mapped with the term "grapheme cluster" (i.e.
one is an older Unicode description of a newer Unicode term).

And once we create these, I think this sort of implementation is
straightforward. For particular values of "straightforward", of
course :-)

i.e. the Swift approach is equivalent to the approach I originally
proposed and asked for critiques of.

One thing I don't understand.... why does the fact the composed
abstract character (aka grapheme cluster) is a sequence mean that an
array cannot be used to hold the sequence?

If people then also want a compatibility-codepoints-only UTF-8
representation, it is simple to provide comparable (i.e
equivalence-testable) versions of any UTF-8 string - because we are
creating them from composed forms by a *single* defined method.

For my part, the reason I think we ought to implement it *in*
Smalltalk is ... this is the String class of the new age. I want
Smalltalk to be handle Strings as native objects.

On 10 December 2015 at 23:41, Ben Coman <[hidden email]> wrote:

> On Wed, Dec 9, 2015 at 5:35 PM, Guillermo Polito
> <[hidden email]> wrote:
>>
>>> On 8 dic 2015, at 10:07 p.m., EuanM <[hidden email]> wrote:
>>>
>>> "No. a codepoint is the numerical value assigned to a character. An
>>> "encoded character" is the way a codepoint is represented in bytes
>>> using a given encoding."
>>>
>>> No.
>>>
>>> A codepoint may represent a component part of an abstract character,
>>> or may represent an abstract character, or it may do both (but not
>>> always at the same time).
>>>
>>> Codepoints represent a single encoding of a single concept.
>>>
>>> Sometimes that concept represents a whole abstract character.
>>> Sometimes it represent part of an abstract character.
>>
>> Well. I do not agree with this. I agree with the quote.
>>
>> Can you explain a bit more about what you mean by abstract character and concept?
>
> This seems to be what Swift is doing, where Strings are not composed
> not of codepoints but of graphemes.
>
>>>> "Every instance of Swift’s Character type represents a single extended grapheme cluster. An extended grapheme cluster is a sequence** of one or more Unicode scalars that (when combined) produce a single human-readable character. [1]
>
> ** i.e. not an array
>
>>>> Here’s an example. The letter é can be represented as the single Unicode scalar é (LATIN SMALL LETTER E WITH ACUTE, or U+00E9). However, the same letter can also be represented as a pair of scalars—a standard letter e (LATIN SMALL LETTER E, or U+0065), followed by the COMBINING ACUTE ACCENT scalar (U+0301). TheCOMBINING ACUTE ACCENT scalar is graphically applied to the scalar that precedes it, turning an e into an éwhen it is rendered by a Unicode-aware text-rendering system. [1]
>
>>>> In both cases, the letter é is represented as a single Swift Character value that represents an extended grapheme cluster. In the first case, the cluster contains a single scalar; in the second case, it is a cluster of two scalars:" [1]
>
>>>> Swiftʼs string implemenation makes working with Unicode easier and significantly less error-prone. As a programmer, you still have to be aware of possible edge cases, but this probably cannot be avoided completely considering the characteristics of Unicode. [2]
>
> Indeed I've tried searched for what problems it causes and get a null
> result. So I read *all*good* things about Swift's unicode
> implementation reducing common errors dealing with Unicode. Can
> anyone point to complaints about Swift's unicode implementation?
> Maybe this...
>
>>>> An argument could be made that the implementation of String as a sequence that requires iterating over characters from the beginning of the string for many operations poses a significant performance problem but I do not think so. My guess is that Appleʼs engineers have considered the implications of their implementation and apps that do not deal with enormous amounts of text will be fine. Moreover, the idea that you could get away with an implementation that supports random access of characters is an illusion given the complexity of Unicode. [2]
>
> Considering our common pattern: Make it work, Make it right, Make it
> fast -- maybe Strings as arrays are a premature optimisation, that
> was the right choice in the past prior to Unicode, but considering
> Moore's Law versus programmer time, is not the best choice now.
> Should we at least start with a UnicodeString and UnicodeCharacter
> that operates like Swift, and over time *maybe* move the tools to use
> them.
>
> [1] https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html
> [2] http://oleb.net/blog/2014/07/swift-strings/
>
> cheers -ben
>
>>
>>>
>>> This is the key difference between Unicode and most character encodings.
>>>
>>> A codepoint does not always represent a whole character.
>>>
>>> On 7 December 2015 at 13:06, Henrik Johansen
>

Ben Coman

Re: [Pharo-dev] Unicode Support

On Fri, Dec 11, 2015 at 10:43 AM, EuanM <[hidden email]> wrote:

> I agree with all of that, Ben.
>
> I'm currently fairly certain that fully-composed abstract characters
> is a term that is 1:1 mapped with the term "grapheme cluster" (i.e.
> one is an older Unicode description of a newer Unicode term).
>
> And once we create these, I think this sort of implementation is
> straightforward. For particular values of "straightforward", of
> course :-)
>
> i.e. the Swift approach is equivalent to the approach I originally
> proposed and asked for critiques of.
>
> One thing I don't understand.... why does the fact the composed
> abstract character (aka grapheme cluster) is a sequence mean that an
> array cannot be used to hold the sequence?

I realised the same question after I had posted. I don't know the
answer. Maybe just that particular implementation.
cheers -ben

> If people then also want a compatibility-codepoints-only UTF-8
> representation, it is simple to provide comparable (i.e
> equivalence-testable) versions of any UTF-8 string - because we are
> creating them from composed forms by a *single* defined method.
>
> For my part, the reason I think we ought to implement it *in*
> Smalltalk is ... this is the String class of the new age. I want
> Smalltalk to be handle Strings as native objects.
>
>
> On 10 December 2015 at 23:41, Ben Coman <[hidden email]> wrote:
>> On Wed, Dec 9, 2015 at 5:35 PM, Guillermo Polito
>> <[hidden email]> wrote:
>>>
>>>> On 8 dic 2015, at 10:07 p.m., EuanM <[hidden email]> wrote:
>>>>
>>>> "No. a codepoint is the numerical value assigned to a character. An
>>>> "encoded character" is the way a codepoint is represented in bytes
>>>> using a given encoding."
>>>>
>>>> No.
>>>>
>>>> A codepoint may represent a component part of an abstract character,
>>>> or may represent an abstract character, or it may do both (but not
>>>> always at the same time).
>>>>
>>>> Codepoints represent a single encoding of a single concept.
>>>>
>>>> Sometimes that concept represents a whole abstract character.
>>>> Sometimes it represent part of an abstract character.
>>>
>>> Well. I do not agree with this. I agree with the quote.
>>>
>>> Can you explain a bit more about what you mean by abstract character and concept?
>>
>> This seems to be what Swift is doing, where Strings are not composed
>> not of codepoints but of graphemes.
>>
>>>>> "Every instance of Swift’s Character type represents a single extended grapheme cluster. An extended grapheme cluster is a sequence** of one or more Unicode scalars that (when combined) produce a single human-readable character. [1]
>>
>> ** i.e. not an array
>>
>>>>> Here’s an example. The letter é can be represented as the single Unicode scalar é (LATIN SMALL LETTER E WITH ACUTE, or U+00E9). However, the same letter can also be represented as a pair of scalars—a standard letter e (LATIN SMALL LETTER E, or U+0065), followed by the COMBINING ACUTE ACCENT scalar (U+0301). TheCOMBINING ACUTE ACCENT scalar is graphically applied to the scalar that precedes it, turning an e into an éwhen it is rendered by a Unicode-aware text-rendering system. [1]
>>
>>>>> In both cases, the letter é is represented as a single Swift Character value that represents an extended grapheme cluster. In the first case, the cluster contains a single scalar; in the second case, it is a cluster of two scalars:" [1]
>>
>>>>> Swiftʼs string implemenation makes working with Unicode easier and significantly less error-prone. As a programmer, you still have to be aware of possible edge cases, but this probably cannot be avoided completely considering the characteristics of Unicode. [2]
>>
>> Indeed I've tried searched for what problems it causes and get a null
>> result. So I read *all*good* things about Swift's unicode
>> implementation reducing common errors dealing with Unicode. Can
>> anyone point to complaints about Swift's unicode implementation?
>> Maybe this...
>>
>>>>> An argument could be made that the implementation of String as a sequence that requires iterating over characters from the beginning of the string for many operations poses a significant performance problem but I do not think so. My guess is that Appleʼs engineers have considered the implications of their implementation and apps that do not deal with enormous amounts of text will be fine. Moreover, the idea that you could get away with an implementation that supports random access of characters is an illusion given the complexity of Unicode. [2]
>>
>> Considering our common pattern: Make it work, Make it right, Make it
>> fast -- maybe Strings as arrays are a premature optimisation, that
>> was the right choice in the past prior to Unicode, but considering
>> Moore's Law versus programmer time, is not the best choice now.
>> Should we at least start with a UnicodeString and UnicodeCharacter
>> that operates like Swift, and over time *maybe* move the tools to use
>> them.
>>
>> [1] https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html
>> [2] http://oleb.net/blog/2014/07/swift-strings/
>>
>> cheers -ben
>>
>>>
>>>>
>>>> This is the key difference between Unicode and most character encodings.
>>>>
>>>> A codepoint does not always represent a whole character.
>>>>
>>>> On 7 December 2015 at 13:06, Henrik Johansen
>>
>

Eliot Miranda-2

Re: [Pharo-dev] Unicode Support

In reply to this post by EuanM

Hi Euan,

> On Dec 10, 2015, at 6:43 PM, EuanM <[hidden email]> wrote:
>
> I agree with all of that, Ben.
>
> I'm currently fairly certain that fully-composed abstract characters
> is a term that is 1:1 mapped with the term "grapheme cluster" (i.e.
> one is an older Unicode description of a newer Unicode term).
>
> And once we create these, I think this sort of implementation is
> straightforward. For particular values of "straightforward", of
> course :-)
>
> i.e. the Swift approach is equivalent to the approach I originally
> proposed and asked for critiques of.
>
> One thing I don't understand.... why does the fact the composed
> abstract character (aka grapheme cluster) is a sequence mean that an
> array cannot be used to hold the sequence?

Of course an Array can be used, but one good reason to use bits organized as four-byte units is that the garbage collector spends no time scanning them, whereas as far as its concerned the Array representation is all objects and must be scanned. Another reason is that foreign code may find the bits representation compatible and so they can be passed through the FFI to other languages whereas the Array of tagged characters will always require conversion. Yet another reason is that in 64-bits the Array takes twice the space of the bits object.

> If people then also want a compatibility-codepoints-only UTF-8
> representation, it is simple to provide comparable (i.e
> equivalence-testable) versions of any UTF-8 string - because we are
> creating them from composed forms by a *single* defined method.
>
> For my part, the reason I think we ought to implement it *in*
> Smalltalk is ... this is the String class of the new age. I want
> Smalltalk to be handle Strings as native objects.

There's little if any difference in convenience of use between an Array of characters and a bits array with the string at:/at:put: primitives since both require at:/at:put: to access, but the latter is (efficiently) type checked (by the VM), whereas there's nothing to prevent storing other than characters in the Areay unless one introduces the overhead of skier explicit type checks in Smalltalk, and the Areay starts life as a sequence of nils (invalid until every element is set to a character) whereas the bits representation begins fully initialized with 0 asCharacter. So there's nothing more "natively objecty" about the Array. Smalltalk objects hide their representation from clients and externally they behave the same, except for space and time.

Given that this is a dynamically-typed language there's nothing to prevent one providing both implementations beyond maintenance cost and complexity/confusion. So at least it's easy to do performance comparisons between the two. But I still think the bits representation is superior if what you want is a sequence of Characters.

>> On 10 December 2015 at 23:41, Ben Coman <[hidden email]> wrote:
>> On Wed, Dec 9, 2015 at 5:35 PM, Guillermo Polito
>> <[hidden email]> wrote:
>>>
>>>> On 8 dic 2015, at 10:07 p.m., EuanM <[hidden email]> wrote:
>>>>
>>>> "No. a codepoint is the numerical value assigned to a character. An
>>>> "encoded character" is the way a codepoint is represented in bytes
>>>> using a given encoding."
>>>>
>>>> No.
>>>>
>>>> A codepoint may represent a component part of an abstract character,
>>>> or may represent an abstract character, or it may do both (but not
>>>> always at the same time).
>>>>
>>>> Codepoints represent a single encoding of a single concept.
>>>>
>>>> Sometimes that concept represents a whole abstract character.
>>>> Sometimes it represent part of an abstract character.
>>>
>>> Well. I do not agree with this. I agree with the quote.
>>>
>>> Can you explain a bit more about what you mean by abstract character and concept?
>>
>> This seems to be what Swift is doing, where Strings are not composed
>> not of codepoints but of graphemes.
>>
>>>>> "Every instance of Swift’s Character type represents a single extended grapheme cluster. An extended grapheme cluster is a sequence** of one or more Unicode scalars that (when combined) produce a single human-readable character. [1]
>>
>> ** i.e. not an array
>>
>>>>> Here’s an example. The letter é can be represented as the single Unicode scalar é (LATIN SMALL LETTER E WITH ACUTE, or U+00E9). However, the same letter can also be represented as a pair of scalars—a standard letter e (LATIN SMALL LETTER E, or U+0065), followed by the COMBINING ACUTE ACCENT scalar (U+0301). TheCOMBINING ACUTE ACCENT scalar is graphically applied to the scalar that precedes it, turning an e into an éwhen it is rendered by a Unicode-aware text-rendering system. [1]
>>
>>>>> In both cases, the letter é is represented as a single Swift Character value that represents an extended grapheme cluster. In the first case, the cluster contains a single scalar; in the second case, it is a cluster of two scalars:" [1]
>>
>>>>> Swiftʼs string implemenation makes working with Unicode easier and significantly less error-prone. As a programmer, you still have to be aware of possible edge cases, but this probably cannot be avoided completely considering the characteristics of Unicode. [2]
>>
>> Indeed I've tried searched for what problems it causes and get a null
>> result. So I read *all*good* things about Swift's unicode
>> implementation reducing common errors dealing with Unicode. Can
>> anyone point to complaints about Swift's unicode implementation?
>> Maybe this...
>>
>>>>> An argument could be made that the implementation of String as a sequence that requires iterating over characters from the beginning of the string for many operations poses a significant performance problem but I do not think so. My guess is that Appleʼs engineers have considered the implications of their implementation and apps that do not deal with enormous amounts of text will be fine. Moreover, the idea that you could get away with an implementation that supports random access of characters is an illusion given the complexity of Unicode. [2]
>>
>> Considering our common pattern: Make it work, Make it right, Make it
>> fast -- maybe Strings as arrays are a premature optimisation, that
>> was the right choice in the past prior to Unicode, but considering
>> Moore's Law versus programmer time, is not the best choice now.
>> Should we at least start with a UnicodeString and UnicodeCharacter
>> that operates like Swift, and over time *maybe* move the tools to use
>> them.
>>
>> [1] https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html
>> [2] http://oleb.net/blog/2014/07/swift-strings/
>>
>> cheers -ben
>>
>>>
>>>>
>>>> This is the key difference between Unicode and most character encodings.
>>>>
>>>> A codepoint does not always represent a whole character.
>>>>
>>>> On 7 December 2015 at 13:06, Henrik Johansen

_,,,^..^,,,_ (phone)

Colin Putney-3

Re: [Pharo-dev] Unicode Support

On Fri, Dec 11, 2015 at 1:29 AM, Eliot Miranda <[hidden email]> wrote:

> For my part, the reason I think we ought to implement it *in*
> Smalltalk is ... this is the String class of the new age. I want
> Smalltalk to be handle Strings as native objects.

There's little if any difference in convenience of use between an Array of characters and a bits array with the string at:/at:put: primitives since both require at:/at:put: to access, but the latter is (efficiently) type checked (by the VM), whereas there's nothing to prevent storing other than characters in the Areay unless one introduces the overhead of skier explicit type checks in Smalltalk, and the Areay starts life as a sequence of nils (invalid until every element is set to a character) whereas the bits representation begins fully initialized with 0 asCharacter. So there's nothing more "natively objecty" about the Array. Smalltalk objects hide their representation from clients and externally they behave the same, except for space and time.

I think Euan was referring to the Gemstone strategy of storing the string content as bits, then calling into ICU (a C++ library for Unicode processing) to manipulate them. So he's saying sure, store the string data as bits, but write Smalltalk code to sort them, render on screen etc.

Colin

Eliot Miranda-2

Re: [Pharo-dev] Unicode Support

On Fri, Dec 11, 2015 at 11:05 AM, Colin Putney <[hidden email]> wrote:

On Fri, Dec 11, 2015 at 1:29 AM, Eliot Miranda <[hidden email]> wrote:
> For my part, the reason I think we ought to implement it *in*
> Smalltalk is ... this is the String class of the new age. I want
> Smalltalk to be handle Strings as native objects.

There's little if any difference in convenience of use between an Array of characters and a bits array with the string at:/at:put: primitives since both require at:/at:put: to access, but the latter is (efficiently) type checked (by the VM), whereas there's nothing to prevent storing other than characters in the Areay unless one introduces the overhead of skier explicit type checks in Smalltalk, and the Areay starts life as a sequence of nils (invalid until every element is set to a character) whereas the bits representation begins fully initialized with 0 asCharacter. So there's nothing more "natively objecty" about the Array. Smalltalk objects hide their representation from clients and externally they behave the same, except for space and time.

I think Euan was referring to the Gemstone strategy of storing the string content as bits, then calling into ICU (a C++ library for Unicode processing) to manipulate them. So he's saying sure, store the string data as bits, but write Smalltalk code to sort them, render on screen etc.

Ah, thanks for the clarification. Unwise of me to read complex emails on the phone :-). Letter box problem. I agree. ICU should be steered clear of at all costs.

Colin

_,,,^..^,,,_

best, Eliot

EuanM

Re: [Pharo-dev] Unicode Support

In reply to this post by EuanM

"If it hasn't already been said, please do not conflate Unicode and
UTF-8. I think that would be a recipe for
a high P.I.T.A. factor." --Richard Sargent

I agree. :-)

Regarding UTF-16, I just want to be able to export to, and receive
from, Windows (and any other platforms using UTF-16 as their native
character representation).

Windows will always be able to accept UTF-16. All Windows apps *might
well* export UTF-16. There may be other platforms which use UTF-16 as
their native format. I'd just like to be able to cope with those
situations. Nothing more.

All this is requires is a Utf16String class that has an asUtf8String
method (and any other required conversion methods). And other string
classes to have asUtf16String classes. Once we have the other classes
and methods, this should be a trivial extensions. Export will just be
transformations of existing formats of valid strings. Import just
needs to transform to (one of) our preferred format(s), and have a
validity check performed after the transform is complete.

On 11 December 2015 at 15:37, Richard Sargent
<[hidden email]> wrote:

> EuanM wrote
>> ...
>> all ISO-8859-1 maps 1:1 to Unicode UTF-8
>> ...
>
> I am late coming in to this conversation. If it hasn't already been said,
> please do not conflate Unicode and UTF-8. I think that would be a recipe for
> a high P.I.T.A. factor.
>
> Unicode defines the meaning of the code points.
> UTF-8 (and -16) define an interchange mechanism.
>
> In other words, when you write the code points to an external medium
> (socket, file, whatever), encode them via UTF-whatever. Read UTF-whatever
> from an external medium and re-instantiate the code points.
> (Personally, I see no use for UTF-16 as an interchange mechanism. Others may
> have justification for it. I don't.)
>
> Having characters be a consistent size in their object representation makes
> everything easier. #at:, #indexOf:, #includes: ... no one wants to be
> scanning through bytes representing variable sized characters.
>
> Model Unicode strings using classes such as e.g. Unicode7, Unicode16, and
> Unicode32, with automatic coercion to the larger character width.
>
>
>
>
> --
> View this message in context: http://forum.world.st/Unicode-Support-tp4865139p4866610.html
> Sent from the Pharo Smalltalk Developers mailing list archive at Nabble.com.
>

EuanM

Re: [Pharo-dev] [squeak-dev] Re: Unicode Support

In reply to this post by Eliot Miranda-2

Eliot - thank you for explaining to me why my original idea was bad. :-)

I always assumed it would be. Otherwise I'd've just built it the way
I proposed.

I'm thoroughly delighted to have more knowledgeable people contributing.

On 11 December 2015 at 09:29, Eliot Miranda <[hidden email]> wrote:

> Hi Euan,
>
>> On Dec 10, 2015, at 6:43 PM, EuanM <[hidden email]> wrote:
>>
>> I agree with all of that, Ben.
>>
>> I'm currently fairly certain that fully-composed abstract characters
>> is a term that is 1:1 mapped with the term "grapheme cluster" (i.e.
>> one is an older Unicode description of a newer Unicode term).
>>
>> And once we create these, I think this sort of implementation is
>> straightforward. For particular values of "straightforward", of
>> course :-)
>>
>> i.e. the Swift approach is equivalent to the approach I originally
>> proposed and asked for critiques of.
>>
>> One thing I don't understand.... why does the fact the composed
>> abstract character (aka grapheme cluster) is a sequence mean that an
>> array cannot be used to hold the sequence?
>
> Of course an Array can be used, but one good reason to use bits organized as four-byte units is that the garbage collector spends no time scanning them, whereas as far as its concerned the Array representation is all objects and must be scanned. Another reason is that foreign code may find the bits representation compatible and so they can be passed through the FFI to other languages whereas the Array of tagged characters will always require conversion. Yet another reason is that in 64-bits the Array takes twice the space of the bits object.
>
>> If people then also want a compatibility-codepoints-only UTF-8
>> representation, it is simple to provide comparable (i.e
>> equivalence-testable) versions of any UTF-8 string - because we are
>> creating them from composed forms by a *single* defined method.
>>
>> For my part, the reason I think we ought to implement it *in*
>> Smalltalk is ... this is the String class of the new age. I want
>> Smalltalk to be handle Strings as native objects.
>
> There's little if any difference in convenience of use between an Array of characters and a bits array with the string at:/at:put: primitives since both require at:/at:put: to access, but the latter is (efficiently) type checked (by the VM), whereas there's nothing to prevent storing other than characters in the Areay unless one introduces the overhead of skier explicit type checks in Smalltalk, and the Areay starts life as a sequence of nils (invalid until every element is set to a character) whereas the bits representation begins fully initialized with 0 asCharacter. So there's nothing more "natively objecty" about the Array. Smalltalk objects hide their representation from clients and externally they behave the same, except for space and time.
>
> Given that this is a dynamically-typed language there's nothing to prevent one providing both implementations beyond maintenance cost and complexity/confusion. So at least it's easy to do performance comparisons between the two. But I still think the bits representation is superior if what you want is a sequence of Characters.
>
>>> On 10 December 2015 at 23:41, Ben Coman <[hidden email]> wrote:
>>> On Wed, Dec 9, 2015 at 5:35 PM, Guillermo Polito
>>> <[hidden email]> wrote:
>>>>
>>>>> On 8 dic 2015, at 10:07 p.m., EuanM <[hidden email]> wrote:
>>>>>
>>>>> "No. a codepoint is the numerical value assigned to a character. An
>>>>> "encoded character" is the way a codepoint is represented in bytes
>>>>> using a given encoding."
>>>>>
>>>>> No.
>>>>>
>>>>> A codepoint may represent a component part of an abstract character,
>>>>> or may represent an abstract character, or it may do both (but not
>>>>> always at the same time).
>>>>>
>>>>> Codepoints represent a single encoding of a single concept.
>>>>>
>>>>> Sometimes that concept represents a whole abstract character.
>>>>> Sometimes it represent part of an abstract character.
>>>>
>>>> Well. I do not agree with this. I agree with the quote.
>>>>
>>>> Can you explain a bit more about what you mean by abstract character and concept?
>>>
>>> This seems to be what Swift is doing, where Strings are not composed
>>> not of codepoints but of graphemes.
>>>
>>>>>> "Every instance of Swift’s Character type represents a single extended grapheme cluster. An extended grapheme cluster is a sequence** of one or more Unicode scalars that (when combined) produce a single human-readable character. [1]
>>>
>>> ** i.e. not an array
>>>
>>>>>> Here’s an example. The letter é can be represented as the single Unicode scalar é (LATIN SMALL LETTER E WITH ACUTE, or U+00E9). However, the same letter can also be represented as a pair of scalars—a standard letter e (LATIN SMALL LETTER E, or U+0065), followed by the COMBINING ACUTE ACCENT scalar (U+0301). TheCOMBINING ACUTE ACCENT scalar is graphically applied to the scalar that precedes it, turning an e into an éwhen it is rendered by a Unicode-aware text-rendering system. [1]
>>>
>>>>>> In both cases, the letter é is represented as a single Swift Character value that represents an extended grapheme cluster. In the first case, the cluster contains a single scalar; in the second case, it is a cluster of two scalars:" [1]
>>>
>>>>>> Swiftʼs string implemenation makes working with Unicode easier and significantly less error-prone. As a programmer, you still have to be aware of possible edge cases, but this probably cannot be avoided completely considering the characteristics of Unicode. [2]
>>>
>>> Indeed I've tried searched for what problems it causes and get a null
>>> result. So I read *all*good* things about Swift's unicode
>>> implementation reducing common errors dealing with Unicode. Can
>>> anyone point to complaints about Swift's unicode implementation?
>>> Maybe this...
>>>
>>>>>> An argument could be made that the implementation of String as a sequence that requires iterating over characters from the beginning of the string for many operations poses a significant performance problem but I do not think so. My guess is that Appleʼs engineers have considered the implications of their implementation and apps that do not deal with enormous amounts of text will be fine. Moreover, the idea that you could get away with an implementation that supports random access of characters is an illusion given the complexity of Unicode. [2]
>>>
>>> Considering our common pattern: Make it work, Make it right, Make it
>>> fast -- maybe Strings as arrays are a premature optimisation, that
>>> was the right choice in the past prior to Unicode, but considering
>>> Moore's Law versus programmer time, is not the best choice now.
>>> Should we at least start with a UnicodeString and UnicodeCharacter
>>> that operates like Swift, and over time *maybe* move the tools to use
>>> them.
>>>
>>> [1] https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html
>>> [2] http://oleb.net/blog/2014/07/swift-strings/
>>>
>>> cheers -ben
>>>
>>>>
>>>>>
>>>>> This is the key difference between Unicode and most character encodings.
>>>>>
>>>>> A codepoint does not always represent a whole character.
>>>>>
>>>>> On 7 December 2015 at 13:06, Henrik Johansen
>
> _,,,^..^,,,_ (phone)

EuanM

Re: [Pharo-dev] Unicode Support

In reply to this post by EuanM

Elliot, what's your take on having heterogenous collections for the
composed Unicode?

i.e. collections with one element for each character, with some
characters being themselves a collection of characters

(Simple character like "a" is one char, and a character which is a
collection of characters is the fully composed version of Ǖ (01d5), a
U (0055) with a diaeresis - ¨ - ( 00a8 aka 0308 in combining form )
on top to form the compatibility character Ü (ooDC) which then gets a
macron- ̄ -( 0304) on top of that

so
a #(0061)

Ǖ #(01d5) = #( 00dc 0304) = #( 0055 0308 0304)

i.e a string which alternated those two characters

'aǕaǕaǕaǕ'

would be represented by something equivalent to:

#( 0061 #( 0055 0308 0304) 0061 #( 0055 0308 0304) 0061 #( 0055 0308
0304) 0061 #( 0055 0308 0304) )

as opposed to a string of compatibility characters:
#( 0061 01d5 0061 01d5 0061 01d5 0061 01d5)

Does alternating the type used for characters in a string have a
significant effect on speed?

On 11 December 2015 at 23:08, Eliot Miranda <[hidden email]> wrote:

> Hi Todd,
>
> On Dec 11, 2015, at 12:57 PM, Todd Blanchard <[hidden email]> wrote:
>
>
> On Dec 11, 2015, at 12:19, EuanM <[hidden email]> wrote:
>
> "If it hasn't already been said, please do not conflate Unicode and
> UTF-8. I think that would be a recipe for
> a high P.I.T.A. factor." --Richard Sargent
>
>
> Well, yes. But I think you guys are making this way too hard.
>
> A unicode character is an abstract idea - for instance the letter 'a'.
> The letter 'a' has a code point - its the number 97. How the number 97 is
> represented in the computer is irrelevant.
>
> Now we get to transfer encodings. These are UTF8, UTF16, etc.... A
> transfer encoding specifies the binary representation of the sequence of
> code points.
>
> UTF8 is a variable length byte encoding. You read it one byte at a time,
> aggregating byte sequences to 'code points'. ByteArray would be an
> excellent choice as a superclass but it must be understood that #at: or
> #at:put refers to a byte, not a character. If you want characters, you have
> to start at the beginning and process it sequentially, like a stream (if
> working in the ASCII domain - you can generally 'cheat' this a bit). A C
> representation would be char utf8[]
>
> UTF16 is also a variable length encoding of two byte quantities - what C
> used to call a 'short int'. You process it in two byte chunks instead of
> one byte chunks. Like UTF8, you must read it sequentially to interpret the
> characters. #at and #at:put: would necessarily refer to byte pairs and not
> characters. A C representation would be short utf16[]; It would also to
> 50% space inefficient for ASCII - which is normally the bulk of your text.
>
> Realistically, you need exactly one in-memory format and stream
> readers/writers that can convert (these are typically table driven state
> machines). My choice would be UTF8 for the internal memory format and the
> ability to read and write from UTF8 to UTF16.
>
> But I stress again...strings don't really need indexability as much as you
> think and neither UTF8 nor UTF16 provide this property anyhow as they are
> variable length encodings. I don't see any sensible reason to have more
> than one in-memory binary format in the image.
>
>
> The only reasons are space and time. If a string only contains code points
> in the range 0-255 there's no point in squandering 4 bytes per code point
> (same goes for 0-65535). Further, if in some application interchange is
> more important than random access it may make sense in performance grounds
> to use utf-8 directly.
>
> Again, Smalltalk's dynamic typing makes it easy to have one's cake and eat
> it too.
>
> My $0.02c
>
>
> _,,,^..^,,,_ (phone)
>
>
> I agree. :-)
>
> Regarding UTF-16, I just want to be able to export to, and receive
> from, Windows (and any other platforms using UTF-16 as their native
> character representation).
>
> Windows will always be able to accept UTF-16. All Windows apps *might
> well* export UTF-16. There may be other platforms which use UTF-16 as
> their native format. I'd just like to be able to cope with those
> situations. Nothing more.
>
> All this is requires is a Utf16String class that has an asUtf8String
> method (and any other required conversion methods).
>
>

Eliot Miranda-2

Re: [Pharo-dev] Unicode Support

Hi Euan,

On Fri, Dec 11, 2015 at 5:45 PM, EuanM <[hidden email]> wrote:

Elliot, what's your take on having heterogenous collections for the
composed Unicode?

I'm not sure I'm understanding the question, but... I'm told by someone in the know that string concatenation is a big deal in certain applications, so providing tree-like representations for strings can be a win since concatenation is O(1) (allocate a new root and assign the two subtrees). It seems reasonable to have a rich library with several representations available with different trade-offs. But I'd let requirements drive design, not feature dreams.

i.e. collections with one element for each character, with some
characters being themselves a collection of characters

(Simple character like "a" is one char, and a character which is a
collection of characters is the fully composed version of Ǖ (01d5), a
U (0055) with a diaeresis - ¨ - ( 00a8 aka 0308 in combining form )
on top to form the compatibility character Ü (ooDC) which then gets a
macron- ̄ -( 0304) on top of that

so
a #(0061)

Ǖ #(01d5) = #( 00dc 0304) = #( 0055 0308 0304)

i.e a string which alternated those two characters

'aǕaǕaǕaǕ'

would be represented by something equivalent to:

#( 0061 #( 0055 0308 0304) 0061 #( 0055 0308 0304) 0061 #( 0055 0308
0304) 0061 #( 0055 0308 0304) )

as opposed to a string of compatibility characters:
#( 0061 01d5 0061 01d5 0061 01d5 0061 01d5)

Does alternating the type used for characters in a string have a
significant effect on speed?

I honestly don't know. You've just gone well beyond my familiarity with the issues :-). I'm just a VM guy :-). But I will say that in cases like this, real applications and the profiler are your friends. Be guided by what you need now, not by what you think you'll need further down the road.

On 11 December 2015 at 23:08, Eliot Miranda <[hidden email]> wrote:
> Hi Todd,
>
> On Dec 11, 2015, at 12:57 PM, Todd Blanchard <[hidden email]> wrote:
>
>
> On Dec 11, 2015, at 12:19, EuanM <[hidden email]> wrote:
>
> "If it hasn't already been said, please do not conflate Unicode and
> UTF-8. I think that would be a recipe for
> a high P.I.T.A. factor." --Richard Sargent
>
>
> Well, yes. But I think you guys are making this way too hard.
>
> A unicode character is an abstract idea - for instance the letter 'a'.
> The letter 'a' has a code point - its the number 97. How the number 97 is
> represented in the computer is irrelevant.
>
> Now we get to transfer encodings. These are UTF8, UTF16, etc.... A
> transfer encoding specifies the binary representation of the sequence of
> code points.
>
> UTF8 is a variable length byte encoding. You read it one byte at a time,
> aggregating byte sequences to 'code points'. ByteArray would be an
> excellent choice as a superclass but it must be understood that #at: or
> #at:put refers to a byte, not a character. If you want characters, you have
> to start at the beginning and process it sequentially, like a stream (if
> working in the ASCII domain - you can generally 'cheat' this a bit). A C
> representation would be char utf8[]
>
> UTF16 is also a variable length encoding of two byte quantities - what C
> used to call a 'short int'. You process it in two byte chunks instead of
> one byte chunks. Like UTF8, you must read it sequentially to interpret the
> characters. #at and #at:put: would necessarily refer to byte pairs and not
> characters. A C representation would be short utf16[]; It would also to
> 50% space inefficient for ASCII - which is normally the bulk of your text.
>
> Realistically, you need exactly one in-memory format and stream
> readers/writers that can convert (these are typically table driven state
> machines). My choice would be UTF8 for the internal memory format and the
> ability to read and write from UTF8 to UTF16.
>
> But I stress again...strings don't really need indexability as much as you
> think and neither UTF8 nor UTF16 provide this property anyhow as they are
> variable length encodings. I don't see any sensible reason to have more
> than one in-memory binary format in the image.
>
>
> The only reasons are space and time. If a string only contains code points
> in the range 0-255 there's no point in squandering 4 bytes per code point
> (same goes for 0-65535). Further, if in some application interchange is
> more important than random access it may make sense in performance grounds
> to use utf-8 directly.
>
> Again, Smalltalk's dynamic typing makes it easy to have one's cake and eat
> it too.
>
> My $0.02c
>
>
> _,,,^..^,,,_ (phone)
>
>
> I agree. :-)
>
> Regarding UTF-16, I just want to be able to export to, and receive
> from, Windows (and any other platforms using UTF-16 as their native
> character representation).
>
> Windows will always be able to accept UTF-16. All Windows apps *might
> well* export UTF-16. There may be other platforms which use UTF-16 as
> their native format. I'd just like to be able to cope with those
> situations. Nothing more.
>
> All this is requires is a Utf16String class that has an asUtf8String
> method (and any other required conversion methods).
>
>

_,,,^..^,,,_

best, Eliot

Stéphane Rollandin

Re: [Pharo-dev] Unicode Support

Eliot Miranda:

> But I'd let requirements drive design, not feature dreams.

...

> Be guided by what you need now, not by what you think you'll need further
> down the road.

These two sentences should be written in human sized gold letters in
your bathroom, people ! Wisdom !

I mean, +1

Stef

Louis LaBrunda

[Pharo-dev] Unicode Support

Hi Guys,

+0.95 I certainly agree we shouldn't write code now that we may need in
the future but it doesn't hurt to look a little ahead and plan a bit. If
the people defining database tables thought a little ahead and didn't try
to save two characters per date by leaving off the century there never
would have been a Y2K scare.

Lou

On Sat, 12 Dec 2015 10:09:24 +0100, Stéphane Rollandin
<[hidden email]> wrote:

>Eliot Miranda:
>
>> But I'd let requirements drive design, not feature dreams.
>
>...
>
>> Be guided by what you need now, not by what you think you'll need further
>> down the road.
>
>
>These two sentences should be written in human sized gold letters in
>your bathroom, people ! Wisdom !
>
>I mean, +1
>
>
>Stef
>

--
Louis LaBrunda
Keystone Software Corp.
SkypeMe callto://PhotonDemon

Eliot Miranda-2

Re: [Pharo-dev] Unicode Support

Hi Louis,

_,,,^..^,,,_ (phone)

> On Dec 12, 2015, at 5:19 AM, Louis LaBrunda <[hidden email]> wrote:
>
> Hi Guys,
>
> +0.95 I certainly agree we shouldn't write code now that we may need in
> the future but it doesn't hurt to look a little ahead and plan a bit. If
> the people defining database tables thought a little ahead and didn't try
> to save two characters per date by leaving off the century there never
> would have been a Y2K scare.
>
> Lou

I am /not/ saying one shouldn't look forward; I am /not/ saying cut corners or do weak design. I am saying don't implement what you're not necessarily going to use. The y2k bug is either bad design or a space optimisation, depending on your point of view, but it isn't an example of requirements driving design. The requirement is to represent dates; the step to use 2 digits for the year is a subsequent step.

Here's perhaps a better example. I've just spent two years working on Spur, the VM & object representation under Squeak 5 and Pharo 5. I designed in support for 64-bits from the start. Because I looked forward, this year the 64-bit version was relatively straight-forward to add. I /could/ have designed the system to support 128 bits on the assumption that some time 64-bits will be superseded by 128 bits. There there was a requirement for 64 bits. There is no requirement for 128 bits. Had I aimed for 128 bits the system would be a mess; too big, slow, more complex to try and get it to work in 32- & 64-bits.

Even further back I designed into the Cogit (the VM's JIT compiler) a split between the generic code generator and an object-representation-specific code generator that generates all code to do with object representation, such as testing for tagged typed, fetching inst vars or allocating new objects, because I knew I wanted to replace the object representation as soon as possible. Had I not done so, adding Spur would have taken much more work; I would have had to engineer the split by refactoring instead of by design.

So being driven by requirements doesn't mean not looking to the future, it means implementing what you need and what you /know/ you're going to need, not what "would be nice to have" or what "sounds like a really cool idea".

_,,,^..^,,,_ (phone)

> On Sat, 12 Dec 2015 10:09:24 +0100, Stéphane Rollandin
> <[hidden email]> wrote:
>
>> Eliot Miranda:
>>
>>> But I'd let requirements drive design, not feature dreams.
>>
>> ...
>>
>>> Be guided by what you need now, not by what you think you'll need further
>>> down the road.
>>
>>
>> These two sentences should be written in human sized gold letters in
>> your bathroom, people ! Wisdom !
>>
>> I mean, +1
>>
>>
>> Stef
> --
> Louis LaBrunda
> Keystone Software Corp.
> SkypeMe callto://PhotonDemon
>
>

Louis LaBrunda

[Pharo-dev] Unicode Support

Hi Eliot,

>Hi Louis,
>_,,,^..^,,,_ (phone)
>> On Dec 12, 2015, at 5:19 AM, Louis LaBrunda <[hidden email]> wrote:
>>
>> Hi Guys,
>>
>> +0.95 I certainly agree we shouldn't write code now that we may need in
>> the future but it doesn't hurt to look a little ahead and plan a bit. If
>> the people defining database tables thought a little ahead and didn't try
>> to save two characters per date by leaving off the century there never
>> would have been a Y2K scare.
>>
>> Lou
>
>I am /not/ saying one shouldn't look forward; I am /not/ saying cut corners or do weak design.

I'm sure you not. And I didn't mean to imply otherwise. And I don't
disagree with anything you've said. I just wanted to be sure everyone
understood there is a difference between not implementing what isn't needed
now and looking ahead enough to allow easy changes in the future. Your
example designing in support for 64-bits from the start is spot on.

>I am saying don't implement what you're not necessarily going to use.

>The y2k bug is either bad design or a space optimisation, depending on your point of view, but it isn't an example of requirements driving design. The requirement is to represent dates; the step to use 2 digits for the year is a subsequent step.

Just a little history because I'm old enough to remember when disk space
was limited and at a premium (I think IBM 3330 were about 200MB). The
choice to leave off the century (back then dates and times were often
stored as strings) was a space saving effort. It was a bad choice, if it
was a conscious one. I'm not sure it was a conscious choice, as many
programmers didn't look very far ahead even to see the approaching century.

>Here's perhaps a better example. I've just spent two years working on Spur, the VM & object representation under Squeak 5 and Pharo 5. I designed in support for 64-bits from the start. Because I looked forward, this year the 64-bit version was relatively straight-forward to add. I /could/ have designed the system to support 128 bits on the assumption that some time 64-bits will be superseded by 128 bits. There there was a requirement for 64 bits. There is no requirement for 128 bits. Had I aimed for 128 bits the system would be a mess; too big, slow, more complex to try and get it to work in 32- & 64-bits.
>
>Even further back I designed into the Cogit (the VM's JIT compiler) a split between the generic code generator and an object-representation-specific code generator that generates all code to do with object representation, such as testing for tagged typed, fetching inst vars or allocating new objects, because I knew I wanted to replace the object representation as soon as possible. Had I not done so, adding Spur would have taken much more work; I would have had to engineer the split by refactoring instead of by design.
>
>So being driven by requirements doesn't mean not looking to the future, it means implementing what you need and what you /know/ you're going to need, not what "would be nice to have" or what "sounds like a really cool idea".
>
>_,,,^..^,,,_ (phone)
>
>> On Sat, 12 Dec 2015 10:09:24 +0100, Stéphane Rollandin
>> <[hidden email]> wrote:
>>
>>> Eliot Miranda:
>>>
>>>> But I'd let requirements drive design, not feature dreams.
>>>
>>> ...
>>>
>>>> Be guided by what you need now, not by what you think you'll need further
>>>> down the road.
>>>
>>>
>>> These two sentences should be written in human sized gold letters in
>>> your bathroom, people ! Wisdom !
>>>
>>> I mean, +1
>>>
>>>
>>> Stef
>> --
>> Louis LaBrunda
>> Keystone Software Corp.
>> SkypeMe callto://PhotonDemon
>>
>>
>

--
Louis LaBrunda
Keystone Software Corp.
SkypeMe callto://PhotonDemon

Ben Coman

Re: [Pharo-dev] Unicode Support

On Sun, Dec 13, 2015 at 3:57 AM, Louis LaBrunda
<[hidden email]> wrote:

> Hi Eliot,
>
>>Hi Louis,
>>_,,,^..^,,,_ (phone)
>>> On Dec 12, 2015, at 5:19 AM, Louis LaBrunda <[hidden email]> wrote:
>>>
>>> Hi Guys,
>>>
>>> +0.95 I certainly agree we shouldn't write code now that we may need in
>>> the future but it doesn't hurt to look a little ahead and plan a bit. If
>>> the people defining database tables thought a little ahead and didn't try
>>> to save two characters per date by leaving off the century there never
>>> would have been a Y2K scare.
>>>
>>> Lou
>>
>>I am /not/ saying one shouldn't look forward; I am /not/ saying cut corners or do weak design.
>
> I'm sure you not. And I didn't mean to imply otherwise. And I don't
> disagree with anything you've said. I just wanted to be sure everyone
> understood there is a difference between not implementing what isn't needed
> now and looking ahead enough to allow easy changes in the future. Your
> example designing in support for 64-bits from the start is spot on.
>
>>I am saying don't implement what you're not necessarily going to use.
>
>>The y2k bug is either bad design or a space optimisation, depending on your point of view, but it isn't an example of requirements driving design. The requirement is to represent dates; the step to use 2 digits for the year is a subsequent step.
>
> Just a little history because I'm old enough to remember when disk space
> was limited and at a premium (I think IBM 3330 were about 200MB). The
> choice to leave off the century (back then dates and times were often
> stored as strings) was a space saving effort. It was a bad choice, if it
> was a conscious one. I'm not sure it was a conscious choice, as many
> programmers didn't look very far ahead even to see the approaching century.

Obligatory...
http://www.businessinsider.com.au/picture-of-ibm-hard-drive-on-airplane-2014-1
now get off my lawn... ;)

cheers -ben

123