Smalltalk › Squeak › Squeak VM

Re: [squeak-dev] Re: [Pharo-dev] Unicode Support

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

5 messages Options

Eliot Miranda-2

Re: [squeak-dev] Re: [Pharo-dev] Unicode Support

Hi Euan,

> On Dec 10, 2015, at 6:43 PM, EuanM <[hidden email]> wrote:
>
> I agree with all of that, Ben.
>
> I'm currently fairly certain that fully-composed abstract characters
> is a term that is 1:1 mapped with the term "grapheme cluster" (i.e.
> one is an older Unicode description of a newer Unicode term).
>
> And once we create these, I think this sort of implementation is
> straightforward. For particular values of "straightforward", of
> course :-)
>
> i.e. the Swift approach is equivalent to the approach I originally
> proposed and asked for critiques of.
>
> One thing I don't understand.... why does the fact the composed
> abstract character (aka grapheme cluster) is a sequence mean that an
> array cannot be used to hold the sequence?

Of course an Array can be used, but one good reason to use bits organized as four-byte units is that the garbage collector spends no time scanning them, whereas as far as its concerned the Array representation is all objects and must be scanned. Another reason is that foreign code may find the bits representation compatible and so they can be passed through the FFI to other languages whereas the Array of tagged characters will always require conversion. Yet another reason is that in 64-bits the Array takes twice the space of the bits object.

> If people then also want a compatibility-codepoints-only UTF-8
> representation, it is simple to provide comparable (i.e
> equivalence-testable) versions of any UTF-8 string - because we are
> creating them from composed forms by a *single* defined method.
>
> For my part, the reason I think we ought to implement it *in*
> Smalltalk is ... this is the String class of the new age. I want
> Smalltalk to be handle Strings as native objects.

There's little if any difference in convenience of use between an Array of characters and a bits array with the string at:/at:put: primitives since both require at:/at:put: to access, but the latter is (efficiently) type checked (by the VM), whereas there's nothing to prevent storing other than characters in the Areay unless one introduces the overhead of skier explicit type checks in Smalltalk, and the Areay starts life as a sequence of nils (invalid until every element is set to a character) whereas the bits representation begins fully initialized with 0 asCharacter. So there's nothing more "natively objecty" about the Array. Smalltalk objects hide their representation from clients and externally they behave the same, except for space and time.

Given that this is a dynamically-typed language there's nothing to prevent one providing both implementations beyond maintenance cost and complexity/confusion. So at least it's easy to do performance comparisons between the two. But I still think the bits representation is superior if what you want is a sequence of Characters.

>> On 10 December 2015 at 23:41, Ben Coman <[hidden email]> wrote:
>> On Wed, Dec 9, 2015 at 5:35 PM, Guillermo Polito
>> <[hidden email]> wrote:
>>>
>>>> On 8 dic 2015, at 10:07 p.m., EuanM <[hidden email]> wrote:
>>>>
>>>> "No. a codepoint is the numerical value assigned to a character. An
>>>> "encoded character" is the way a codepoint is represented in bytes
>>>> using a given encoding."
>>>>
>>>> No.
>>>>
>>>> A codepoint may represent a component part of an abstract character,
>>>> or may represent an abstract character, or it may do both (but not
>>>> always at the same time).
>>>>
>>>> Codepoints represent a single encoding of a single concept.
>>>>
>>>> Sometimes that concept represents a whole abstract character.
>>>> Sometimes it represent part of an abstract character.
>>>
>>> Well. I do not agree with this. I agree with the quote.
>>>
>>> Can you explain a bit more about what you mean by abstract character and concept?
>>
>> This seems to be what Swift is doing, where Strings are not composed
>> not of codepoints but of graphemes.
>>
>>>>> "Every instance of Swift’s Character type represents a single extended grapheme cluster. An extended grapheme cluster is a sequence** of one or more Unicode scalars that (when combined) produce a single human-readable character. [1]
>>
>> ** i.e. not an array
>>
>>>>> Here’s an example. The letter é can be represented as the single Unicode scalar é (LATIN SMALL LETTER E WITH ACUTE, or U+00E9). However, the same letter can also be represented as a pair of scalars—a standard letter e (LATIN SMALL LETTER E, or U+0065), followed by the COMBINING ACUTE ACCENT scalar (U+0301). TheCOMBINING ACUTE ACCENT scalar is graphically applied to the scalar that precedes it, turning an e into an éwhen it is rendered by a Unicode-aware text-rendering system. [1]
>>
>>>>> In both cases, the letter é is represented as a single Swift Character value that represents an extended grapheme cluster. In the first case, the cluster contains a single scalar; in the second case, it is a cluster of two scalars:" [1]
>>
>>>>> Swiftʼs string implemenation makes working with Unicode easier and significantly less error-prone. As a programmer, you still have to be aware of possible edge cases, but this probably cannot be avoided completely considering the characteristics of Unicode. [2]
>>
>> Indeed I've tried searched for what problems it causes and get a null
>> result. So I read *all*good* things about Swift's unicode
>> implementation reducing common errors dealing with Unicode. Can
>> anyone point to complaints about Swift's unicode implementation?
>> Maybe this...
>>
>>>>> An argument could be made that the implementation of String as a sequence that requires iterating over characters from the beginning of the string for many operations poses a significant performance problem but I do not think so. My guess is that Appleʼs engineers have considered the implications of their implementation and apps that do not deal with enormous amounts of text will be fine. Moreover, the idea that you could get away with an implementation that supports random access of characters is an illusion given the complexity of Unicode. [2]
>>
>> Considering our common pattern: Make it work, Make it right, Make it
>> fast -- maybe Strings as arrays are a premature optimisation, that
>> was the right choice in the past prior to Unicode, but considering
>> Moore's Law versus programmer time, is not the best choice now.
>> Should we at least start with a UnicodeString and UnicodeCharacter
>> that operates like Swift, and over time *maybe* move the tools to use
>> them.
>>
>> [1] https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html
>> [2] http://oleb.net/blog/2014/07/swift-strings/
>>
>> cheers -ben
>>
>>>
>>>>
>>>> This is the key difference between Unicode and most character encodings.
>>>>
>>>> A codepoint does not always represent a whole character.
>>>>
>>>> On 7 December 2015 at 13:06, Henrik Johansen

_,,,^..^,,,_ (phone)

KenDickey

Re: [squeak-dev] Re: [Pharo-dev] Unicode Support

> > On Dec 10, 2015, at 6:43 PM, EuanM <[hidden email]> wrote:
>...
> > One thing I don't understand.... why does the fact the composed
> > abstract character (aka grapheme cluster) is a sequence mean that an
> > array cannot be used to hold the sequence?

Sorry, I missed the start of this discussion, so I may be _way_ off base here, but I wanted to suggest an alternative representation.

An array of binary bytes could hold the Unicode. No GC scans needed. An internediate map (easily compacted) could note the grapheme clusters so that one would get an O(1) access to Unicode characters. In the trivial case, the map is direct.

This would allow UTF8, UTF16, UTF32, whatever, at ther binary level and it would handle grapheme clusters when accessing a composed "character".

This does not solve the "wide char replacing narrow char" problem, but a ropes like solution would work here. I.e. the binary-bytes vec is immutable and at:put: just adds new chars to the map layer. "copying the string" could yield a new binary-vec with the characters inserted.

$0.02,
-KenD

-KenD

Colin Putney-3

Re: [squeak-dev] Re: [Pharo-dev] Unicode Support

In reply to this post by Eliot Miranda-2

On Fri, Dec 11, 2015 at 1:29 AM, Eliot Miranda <[hidden email]> wrote:

> For my part, the reason I think we ought to implement it *in*
> Smalltalk is ... this is the String class of the new age. I want
> Smalltalk to be handle Strings as native objects.

There's little if any difference in convenience of use between an Array of characters and a bits array with the string at:/at:put: primitives since both require at:/at:put: to access, but the latter is (efficiently) type checked (by the VM), whereas there's nothing to prevent storing other than characters in the Areay unless one introduces the overhead of skier explicit type checks in Smalltalk, and the Areay starts life as a sequence of nils (invalid until every element is set to a character) whereas the bits representation begins fully initialized with 0 asCharacter. So there's nothing more "natively objecty" about the Array. Smalltalk objects hide their representation from clients and externally they behave the same, except for space and time.

I think Euan was referring to the Gemstone strategy of storing the string content as bits, then calling into ICU (a C++ library for Unicode processing) to manipulate them. So he's saying sure, store the string data as bits, but write Smalltalk code to sort them, render on screen etc.

Colin

Eliot Miranda-2

Re: [squeak-dev] Re: [Pharo-dev] Unicode Support

On Fri, Dec 11, 2015 at 11:05 AM, Colin Putney <[hidden email]> wrote:

On Fri, Dec 11, 2015 at 1:29 AM, Eliot Miranda <[hidden email]> wrote:
> For my part, the reason I think we ought to implement it *in*
> Smalltalk is ... this is the String class of the new age. I want
> Smalltalk to be handle Strings as native objects.

There's little if any difference in convenience of use between an Array of characters and a bits array with the string at:/at:put: primitives since both require at:/at:put: to access, but the latter is (efficiently) type checked (by the VM), whereas there's nothing to prevent storing other than characters in the Areay unless one introduces the overhead of skier explicit type checks in Smalltalk, and the Areay starts life as a sequence of nils (invalid until every element is set to a character) whereas the bits representation begins fully initialized with 0 asCharacter. So there's nothing more "natively objecty" about the Array. Smalltalk objects hide their representation from clients and externally they behave the same, except for space and time.

I think Euan was referring to the Gemstone strategy of storing the string content as bits, then calling into ICU (a C++ library for Unicode processing) to manipulate them. So he's saying sure, store the string data as bits, but write Smalltalk code to sort them, render on screen etc.

Ah, thanks for the clarification. Unwise of me to read complex emails on the phone :-). Letter box problem. I agree. ICU should be steered clear of at all costs.

Colin

_,,,^..^,,,_

best, Eliot

Eliot Miranda-2

Re: [squeak-dev] Re: [Pharo-dev] Unicode Support

In reply to this post by Eliot Miranda-2

Hi Euan,

On Fri, Dec 11, 2015 at 5:45 PM, EuanM <[hidden email]> wrote:

Elliot, what's your take on having heterogenous collections for the
composed Unicode?

I'm not sure I'm understanding the question, but... I'm told by someone in the know that string concatenation is a big deal in certain applications, so providing tree-like representations for strings can be a win since concatenation is O(1) (allocate a new root and assign the two subtrees). It seems reasonable to have a rich library with several representations available with different trade-offs. But I'd let requirements drive design, not feature dreams.

i.e. collections with one element for each character, with some
characters being themselves a collection of characters

(Simple character like "a" is one char, and a character which is a
collection of characters is the fully composed version of Ǖ (01d5), a
U (0055) with a diaeresis - ¨ - ( 00a8 aka 0308 in combining form )
on top to form the compatibility character Ü (ooDC) which then gets a
macron- ̄ -( 0304) on top of that

so
a #(0061)

Ǖ #(01d5) = #( 00dc 0304) = #( 0055 0308 0304)

i.e a string which alternated those two characters

'aǕaǕaǕaǕ'

would be represented by something equivalent to:

#( 0061 #( 0055 0308 0304) 0061 #( 0055 0308 0304) 0061 #( 0055 0308
0304) 0061 #( 0055 0308 0304) )

as opposed to a string of compatibility characters:
#( 0061 01d5 0061 01d5 0061 01d5 0061 01d5)

Does alternating the type used for characters in a string have a
significant effect on speed?

I honestly don't know. You've just gone well beyond my familiarity with the issues :-). I'm just a VM guy :-). But I will say that in cases like this, real applications and the profiler are your friends. Be guided by what you need now, not by what you think you'll need further down the road.

On 11 December 2015 at 23:08, Eliot Miranda <[hidden email]> wrote:
> Hi Todd,
>
> On Dec 11, 2015, at 12:57 PM, Todd Blanchard <[hidden email]> wrote:
>
>
> On Dec 11, 2015, at 12:19, EuanM <[hidden email]> wrote:
>
> "If it hasn't already been said, please do not conflate Unicode and
> UTF-8. I think that would be a recipe for
> a high P.I.T.A. factor." --Richard Sargent
>
>
> Well, yes. But I think you guys are making this way too hard.
>
> A unicode character is an abstract idea - for instance the letter 'a'.
> The letter 'a' has a code point - its the number 97. How the number 97 is
> represented in the computer is irrelevant.
>
> Now we get to transfer encodings. These are UTF8, UTF16, etc.... A
> transfer encoding specifies the binary representation of the sequence of
> code points.
>
> UTF8 is a variable length byte encoding. You read it one byte at a time,
> aggregating byte sequences to 'code points'. ByteArray would be an
> excellent choice as a superclass but it must be understood that #at: or
> #at:put refers to a byte, not a character. If you want characters, you have
> to start at the beginning and process it sequentially, like a stream (if
> working in the ASCII domain - you can generally 'cheat' this a bit). A C
> representation would be char utf8[]
>
> UTF16 is also a variable length encoding of two byte quantities - what C
> used to call a 'short int'. You process it in two byte chunks instead of
> one byte chunks. Like UTF8, you must read it sequentially to interpret the
> characters. #at and #at:put: would necessarily refer to byte pairs and not
> characters. A C representation would be short utf16[]; It would also to
> 50% space inefficient for ASCII - which is normally the bulk of your text.
>
> Realistically, you need exactly one in-memory format and stream
> readers/writers that can convert (these are typically table driven state
> machines). My choice would be UTF8 for the internal memory format and the
> ability to read and write from UTF8 to UTF16.
>
> But I stress again...strings don't really need indexability as much as you
> think and neither UTF8 nor UTF16 provide this property anyhow as they are
> variable length encodings. I don't see any sensible reason to have more
> than one in-memory binary format in the image.
>
>
> The only reasons are space and time. If a string only contains code points
> in the range 0-255 there's no point in squandering 4 bytes per code point
> (same goes for 0-65535). Further, if in some application interchange is
> more important than random access it may make sense in performance grounds
> to use utf-8 directly.
>
> Again, Smalltalk's dynamic typing makes it easy to have one's cake and eat
> it too.
>
> My $0.02c
>
>
> _,,,^..^,,,_ (phone)
>
>
> I agree. :-)
>
> Regarding UTF-16, I just want to be able to export to, and receive
> from, Windows (and any other platforms using UTF-16 as their native
> character representation).
>
> Windows will always be able to accept UTF-16. All Windows apps *might
> well* export UTF-16. There may be other platforms which use UTF-16 as
> their native format. I'd just like to be able to cope with those
> situations. Nothing more.
>
> All this is requires is a Utf16String class that has an asUtf8String
> method (and any other required conversion methods).
>
>

_,,,^..^,,,_

best, Eliot