Smalltalk › Cuis Smalltalk

Unicode in Cuis

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

13 messages Options

Juan Vuletich-4

Unicode in Cuis

Hi Folks,

I was'n able to jump before into the recent discussion, but I've been
thinking a bit about all this. This http://en.wikipedia.org/wiki/UTF8
made me realize that using UTF8 internally to represent Unicode strings
is a way to minimize required changes to existing software.

Hannes, your proposal for representing WideStrings as
variableWordSubclass is like http://en.wikipedia.org/wiki/UTF-32, right?
And if I remember correctly, it is what Squeak uses.

I ended sketching this to compare alternatives:

ISO 8859-15
========
pros:
-------
- We already have a lot of specific primitives in the VM
- Efficient use of memory
- Very fast #at: and #at:put:
cons:
-------
- Can only represent Latin alphabets and not full Unicode

Unicode UTF-8 (in a new variableByteSubclass)
pros:
-------
- We could reuse many existing String primitives
- It was created to allow existing code deal with Unicode with
minimum changes
- Efficient use of memory
cons:
-------
- Does not allow #at:put:
- #at: is very slow, O(n) instead of O(1)

Unicode UTF-32 (in a new variableWordSubclass)
pros:
-------
- Very fast #at: and #at:put: (although I guess we don use this
much... especially #at:put:, as Strings are usually regarded as immutable)
cons:
--------
- very inefficient use of memory (4 bytes for every character!)
- doesn't take advantage of existing String primitives (sloooow)

I think that for some time, our main Character/String representation
should be ISO 8859-15. Switching to full Unicode would require a lot of
work in the paragraph layout and text rendering engines. But we can
start building a Unicode representation that could be used for webapps.

So I suggest:

- Build an Utf8String variableByteSubclass and Utf8Character. Try to use
String primitives. Do not include operations such as #at:, to discourage
their use.

- Make conversion to/from ISO 8859-15 lossless, by a good codification
of CodePoints in regular Strings (not unlike Hannes' recent contribution).

- Use this for the Clipboard. This is pretty close to what we have now,
but we'd allow using an external Unicode textEditor to build any Unicode
text, and by copying and pasting into a String literal in Cuis code,
we'd have something that can be converted back into UTF-8 without losing
any content.

- Web servers should convert back Strings into UTF-8. This would let us
handle and serve content using full Unicode, without needing to wait
until our tools can display it properly.

- Use this when editing external files, if they happen to be in UTF-8
(there are good heuristics for determining this). On save, we offer the
option to save as UTF-8 or ISO 8859-15.

- We can start adapting the environment to avoid using the String
protocols that are a problem for UTF-8 (i.e. #at:, #at:put: and
related). This would ease an eventual migration to using only UTF-8 for
everything.

What do you think? Do we have a plan?

Cheers,
Juan Vuletich

_______________________________________________
Cuis mailing list
[hidden email]
http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org

Janko Mivšek

Re: Unicode in Cuis

Hi Juan,

Please don't use UTF8 encoded strings internally but store Unicode
strings as plain Unicode in image, while provide encoding to/from UTF8
"on the edge" of image, that is, when interfacing to outside worlds like
file system, sockets etc.

All other Smalltalks have such Unicode solution and I think this is
simplest to achieve. Specially because you can reserve all usual string
manipulations from simplest like determining the string size. Try this
on UTF8 encoded string ...

All you need for that is a ByteString for ASCII/ISO8859-1 etc (<256) and
WideString for all other Unicode, and specially a flexible Character
class, which can handle both 8bit and wider Unicode "code points". It
seems that a Character class is most problematic to achieve.

I once worked on a Unicode patch to Squeak (to introduce interim
TwoByteString as in VW for more efficient storing Unicode strings in
image), maybe this can help you (and Hannes) a bit:

Unicode patch
http://forum.world.st/Unicode-patch-tt64011.html
as part of this very long discussion:
http://forum.world.st/New-Win32-VM-m17n-testers-needed-tp63730p63798.html

One more resource:

1) The Design and Implementation of Multilingualized Squeak
Yoshiki Ohshima, Kazuhiro Abe
http://www.is.titech.ac.jp/~ohshima/squeak/m17npaper/main.html

Best regards
Janko

Dne 07. 02. 2013 03:59, piše Juan Vuletich:

> Hi Folks,
>
> I was'n able to jump before into the recent discussion, but I've been
> thinking a bit about all this. This http://en.wikipedia.org/wiki/UTF8
> made me realize that using UTF8 internally to represent Unicode strings
> is a way to minimize required changes to existing software.
>
> Hannes, your proposal for representing WideStrings as
> variableWordSubclass is like http://en.wikipedia.org/wiki/UTF-32, right?
> And if I remember correctly, it is what Squeak uses.
>
> I ended sketching this to compare alternatives:
>
>
> ISO 8859-15
> ========
> pros:
> -------
> - We already have a lot of specific primitives in the VM
> - Efficient use of memory
> - Very fast #at: and #at:put:
> cons:
> -------
> - Can only represent Latin alphabets and not full Unicode
>
> Unicode UTF-8 (in a new variableByteSubclass)
> pros:
> -------
> - We could reuse many existing String primitives
> - It was created to allow existing code deal with Unicode with
> minimum changes
> - Efficient use of memory
> cons:
> -------
> - Does not allow #at:put:
> - #at: is very slow, O(n) instead of O(1)
>
> Unicode UTF-32 (in a new variableWordSubclass)
> pros:
> -------
> - Very fast #at: and #at:put: (although I guess we don use this
> much... especially #at:put:, as Strings are usually regarded as immutable)
> cons:
> --------
> - very inefficient use of memory (4 bytes for every character!)
> - doesn't take advantage of existing String primitives (sloooow)
>
>
> I think that for some time, our main Character/String representation
> should be ISO 8859-15. Switching to full Unicode would require a lot of
> work in the paragraph layout and text rendering engines. But we can
> start building a Unicode representation that could be used for webapps.
>
> So I suggest:
>
> - Build an Utf8String variableByteSubclass and Utf8Character. Try to use
> String primitives. Do not include operations such as #at:, to discourage
> their use.
>
> - Make conversion to/from ISO 8859-15 lossless, by a good codification
> of CodePoints in regular Strings (not unlike Hannes' recent contribution).
>
> - Use this for the Clipboard. This is pretty close to what we have now,
> but we'd allow using an external Unicode textEditor to build any Unicode
> text, and by copying and pasting into a String literal in Cuis code,
> we'd have something that can be converted back into UTF-8 without losing
> any content.
>
> - Web servers should convert back Strings into UTF-8. This would let us
> handle and serve content using full Unicode, without needing to wait
> until our tools can display it properly.
>
> - Use this when editing external files, if they happen to be in UTF-8
> (there are good heuristics for determining this). On save, we offer the
> option to save as UTF-8 or ISO 8859-15.
>
> - We can start adapting the environment to avoid using the String
> protocols that are a problem for UTF-8 (i.e. #at:, #at:put: and
> related). This would ease an eventual migration to using only UTF-8 for
> everything.
>
> What do you think? Do we have a plan?
>
> Cheers,
> Juan Vuletich
>
> _______________________________________________
> Cuis mailing list
> [hidden email]
> http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
>

--
Janko Mivšek
Svetovalec za informatiko
Eranova d.o.o.
Ljubljana, Slovenija
www.eranova.si
tel: 01 514 22 55
faks: 01 514 22 56
gsm: 031 674 565

_______________________________________________
Cuis mailing list
[hidden email]
http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org

Angel Java Lopez

Re: Unicode in Cuis

In reply to this post by Juan Vuletich-4

Hi people!

I just found:

http://www.cprogramming.com/tutorial/unicode.html

where a similar pro/cons were discussed

and

http://msdn.microsoft.com/en-us/library/windows/desktop/dd374061(v=vs.85).aspx

Well, AFAIK, unicode UTF-16 was the way adopted by C and others to represent Unicode strings in memory. According the first article, a direct representation of any Unicode character should be 4 bytes. But UTF-16 is the encoding adopted to:

- avoid waste space

- easy access to a character

I found too

http://www.evanjones.ca/unicode-in-c.html

Contrary to popular belief, it is possible for a Unicode character to require multiple 16-bit values.

What format do you use for data that goes in and out of your software, and what format do you use internally?

Then, the author points to another link

http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode

Inside your software, store text as UTF-8 or UTF-16; that is to say, pick one of the two and stick with it.

Another interesting post by tim bray

http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF

My guess:

- Cuis Smalltalk could support both internal representation, UTF-8 AND UTF-16. A WideString class could be written. And I guess, a WideChar. Their protocols should be the same as their "byte" counterparts. What I missing: what is the problems in this approach? All problems should be in the internal implementation of WideString and WideChar, and in the points where the code should decide:

- a new string is needed, what kind of?

- I have an string object (String or WideString). I need to encode to store out of memory. What kind of encoding?

Yes, it could be a lot of work, but it's the way adopted by many languages, libraries, technologies: a clear separation btw internal representation vs encoding. The only twist, Smalltalk-like, it's the possibility of having TWO internal representations using Smalltalk capabilitities.

The other way: bite the bullit, ONE internal representation, UTF-16, all encapsulated in String (I guess it is the strategy adopted by Java and .NET). And have new methods when needed, to indicate the encoding (I guess in I/O, serialization), when it is needed, having a default encoding (UTF-8?)

I don't understand why webservers should convert back Strings into UTF-8. No encoding for text response in http protocol? I read http://en.wikipedia.org/wiki/Character_encodings_in_HTML#Specifying_the_document.27s_character_encoding

AFAIK, everything related to string I/O has specified encoding in some way, these days.

Angel "Java" Lopez

@ajlopez

On Wed, Feb 6, 2013 at 11:59 PM, Juan Vuletich <[hidden email]> wrote:

Hi Folks,

I was'n able to jump before into the recent discussion, but I've been thinking a bit about all this. This http://en.wikipedia.org/wiki/UTF8 made me realize that using UTF8 internally to represent Unicode strings is a way to minimize required changes to existing software.

Hannes, your proposal for representing WideStrings as variableWordSubclass is like http://en.wikipedia.org/wiki/UTF-32, right? And if I remember correctly, it is what Squeak uses.

I ended sketching this to compare alternatives:

ISO 8859-15
========
pros:
-------
- We already have a lot of specific primitives in the VM
- Efficient use of memory
- Very fast #at: and #at:put:
cons:
-------
- Can only represent Latin alphabets and not full Unicode

Unicode UTF-8 (in a new variableByteSubclass)
pros:
-------
- We could reuse many existing String primitives
- It was created to allow existing code deal with Unicode with minimum changes
- Efficient use of memory
cons:
-------
- Does not allow #at:put:
- #at: is very slow, O(n) instead of O(1)

Unicode UTF-32 (in a new variableWordSubclass)
pros:
-------
- Very fast #at: and #at:put: (although I guess we don use this much... especially #at:put:, as Strings are usually regarded as immutable)
cons:
--------
- very inefficient use of memory (4 bytes for every character!)
- doesn't take advantage of existing String primitives (sloooow)

I think that for some time, our main Character/String representation should be ISO 8859-15. Switching to full Unicode would require a lot of work in the paragraph layout and text rendering engines. But we can start building a Unicode representation that could be used for webapps.

So I suggest:

- Build an Utf8String variableByteSubclass and Utf8Character. Try to use String primitives. Do not include operations such as #at:, to discourage their use.

- Make conversion to/from ISO 8859-15 lossless, by a good codification of CodePoints in regular Strings (not unlike Hannes' recent contribution).

- Use this for the Clipboard. This is pretty close to what we have now, but we'd allow using an external Unicode textEditor to build any Unicode text, and by copying and pasting into a String literal in Cuis code, we'd have something that can be converted back into UTF-8 without losing any content.

- Web servers should convert back Strings into UTF-8. This would let us handle and serve content using full Unicode, without needing to wait until our tools can display it properly.

- Use this when editing external files, if they happen to be in UTF-8 (there are good heuristics for determining this). On save, we offer the option to save as UTF-8 or ISO 8859-15.

- We can start adapting the environment to avoid using the String protocols that are a problem for UTF-8 (i.e. #at:, #at:put: and related). This would ease an eventual migration to using only UTF-8 for everything.

What do you think? Do we have a plan?

Cheers,
Juan Vuletich

_______________________________________________
Cuis mailing list
[hidden email]
http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org

_______________________________________________
Cuis mailing list
[hidden email]
http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org

Juan Vuletich-4

Re: Unicode in Cuis

In reply to this post by Janko Mivšek

Hi Janko,

On 2/7/2013 4:44 AM, Janko Mivšek wrote:
> Hi Juan,
>
> Please don't use UTF8 encoded strings internally but store Unicode
> strings as plain Unicode in image, while provide encoding to/from UTF8
> "on the edge" of image, that is, when interfacing to outside worlds like
> file system, sockets etc.

There's no such thing as "plain Unicode". What you mean is UTF-32.

> All other Smalltalks have such Unicode solution and I think this is
> simplest to achieve. Specially because you can reserve all usual string
> manipulations from simplest like determining the string size. Try this
> on UTF8 encoded string ...
>
> All you need for that is a ByteString for ASCII/ISO8859-1 etc (<256) and

(side note: Cuis already uses ISO 8859-15, that is quite better than ISO
8859-1)

> WideString for all other Unicode, and specially a flexible Character
> class, which can handle both 8bit and wider Unicode "code points". It
> seems that a Character class is most problematic to achieve.

Do we need multiple String representations? Can just one suffice?

> I once worked on a Unicode patch to Squeak (to introduce interim
> TwoByteString as in VW for more efficient storing Unicode strings in
> image), maybe this can help you (and Hannes) a bit:
>
> Unicode patch
> http://forum.world.st/Unicode-patch-tt64011.html
> as part of this very long discussion:
> http://forum.world.st/New-Win32-VM-m17n-testers-needed-tp63730p63798.html

Interesting. It shows that this can get complex.

> One more resource:
>
> 1) The Design and Implementation of Multilingualized Squeak
> Yoshiki Ohshima, Kazuhiro Abe
> http://www.is.titech.ac.jp/~ohshima/squeak/m17npaper/main.html
>
> Best regards
> Janko

Let me rephrase my though. Right now, Cuis uses single byte Char/String.
No matter what, we need to support UTF-8. So, is UTF-8 enough? Can we
avoid adding a third encoding and implementation? Can we drop
ISO-8859-15 some day (and use ONLY UTF-8 everyewhere)?

In Cuis, we try to do the simplest thing. Having one single encoding for
Chars and String would indeed be the simplest.

Looking a bit at the links you send, everybody agrees with me that the
main drawback of UTF-8 is slow #at:, and #size (thanks for mentioning
#size), and impossible #at:put:. So, is this a serious problem? Maybe
not, especially if we encourage stream like processing and not array
like access; and this is the main question I'd like to answer.

Cheers,
Juan Vuletich

> Dne 07. 02. 2013 03:59, piše Juan Vuletich:
>> Hi Folks,
>>
>> I was'n able to jump before into the recent discussion, but I've been
>> thinking a bit about all this. This http://en.wikipedia.org/wiki/UTF8
>> made me realize that using UTF8 internally to represent Unicode strings
>> is a way to minimize required changes to existing software.
>>
>> Hannes, your proposal for representing WideStrings as
>> variableWordSubclass is like http://en.wikipedia.org/wiki/UTF-32, right?
>> And if I remember correctly, it is what Squeak uses.
>>
>> I ended sketching this to compare alternatives:
>>
>>
>> ISO 8859-15
>> ========
>> pros:
>> -------
>> - We already have a lot of specific primitives in the VM
>> - Efficient use of memory
>> - Very fast #at: and #at:put:
>> cons:
>> -------
>> - Can only represent Latin alphabets and not full Unicode
>>
>> Unicode UTF-8 (in a new variableByteSubclass)
>> pros:
>> -------
>> - We could reuse many existing String primitives
>> - It was created to allow existing code deal with Unicode with
>> minimum changes
>> - Efficient use of memory
>> cons:
>> -------
>> - Does not allow #at:put:
>> - #at: is very slow, O(n) instead of O(1)
>>
>> Unicode UTF-32 (in a new variableWordSubclass)
>> pros:
>> -------
>> - Very fast #at: and #at:put: (although I guess we don use this
>> much... especially #at:put:, as Strings are usually regarded as immutable)
>> cons:
>> --------
>> - very inefficient use of memory (4 bytes for every character!)
>> - doesn't take advantage of existing String primitives (sloooow)
>>
>>
>> I think that for some time, our main Character/String representation
>> should be ISO 8859-15. Switching to full Unicode would require a lot of
>> work in the paragraph layout and text rendering engines. But we can
>> start building a Unicode representation that could be used for webapps.
>>
>> So I suggest:
>>
>> - Build an Utf8String variableByteSubclass and Utf8Character. Try to use
>> String primitives. Do not include operations such as #at:, to discourage
>> their use.
>>
>> - Make conversion to/from ISO 8859-15 lossless, by a good codification
>> of CodePoints in regular Strings (not unlike Hannes' recent contribution).
>>
>> - Use this for the Clipboard. This is pretty close to what we have now,
>> but we'd allow using an external Unicode textEditor to build any Unicode
>> text, and by copying and pasting into a String literal in Cuis code,
>> we'd have something that can be converted back into UTF-8 without losing
>> any content.
>>
>> - Web servers should convert back Strings into UTF-8. This would let us
>> handle and serve content using full Unicode, without needing to wait
>> until our tools can display it properly.
>>
>> - Use this when editing external files, if they happen to be in UTF-8
>> (there are good heuristics for determining this). On save, we offer the
>> option to save as UTF-8 or ISO 8859-15.
>>
>> - We can start adapting the environment to avoid using the String
>> protocols that are a problem for UTF-8 (i.e. #at:, #at:put: and
>> related). This would ease an eventual migration to using only UTF-8 for
>> everything.
>>
>> What do you think? Do we have a plan?
>>
>> Cheers,
>> Juan Vuletich
>>
>> _______________________________________________
>> Cuis mailing list
>> [hidden email]
>> http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
>>

_______________________________________________
Cuis mailing list
[hidden email]
http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org

Juan Vuletich-4

Re: Unicode in Cuis

In reply to this post by Angel Java Lopez

Hi Angel,

On 2/7/2013 7:07 AM, Angel Java Lopez wrote:

Hi people!

I just found:
http://www.cprogramming.com/tutorial/unicode.html

where a similar pro/cons were discussed

and

http://msdn.microsoft.com/en-us/library/windows/desktop/dd374061(v=vs.85).aspx

Well, AFAIK, unicode UTF-16 was the way adopted by C and others to represent Unicode strings in memory. According the first article, a direct representation of any Unicode character should be 4 bytes. But UTF-16 is the encoding adopted to:

- avoid waste space

- easy access to a character

I found too

http://www.evanjones.ca/unicode-in-c.html

Contrary to popular belief, it is possible for a Unicode character to require multiple 16-bit values.

What format do you use for data that goes in and out of your software, and what format do you use internally?

Then, the author points to another link

http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode

Inside your software, store text as UTF-8 or UTF-16; that is to say, pick one of the two and stick with it.

Another interesting post by tim bray

http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF

Thanks for the links.

My guess:

- Cuis Smalltalk could support both internal representation, UTF-8 AND UTF-16. A WideString class could be written. And I guess, a WideChar. Their protocols should be the same as their "byte" counterparts. What I missing: what is the problems in this approach? All problems should be in the internal implementation of WideString and WideChar, and in the points where the code should decide:

- a new string is needed, what kind of?

One that can store Unicode. Candidates are UTF-8 and UTF-32.

- I have an string object (String or WideString). I need to encode to store out of memory. What kind of encoding?

That's easy to answer: UTF-8.

Yes, it could be a lot of work, but it's the way adopted by many languages, libraries, technologies:

I don't mean to sound harsh, but we don't decide on statistics. If we did that, we would not be doing Smalltalk :) .

a clear separation btw internal representation vs encoding. The only twist, Smalltalk-like, it's the possibility of having TWO internal representations using Smalltalk capabilitities.

The other way: bite the bullit, ONE internal representation, UTF-16, all encapsulated in String (I guess it is the strategy adopted by Java and .NET). And have new methods when needed, to indicate the encoding (I guess in I/O, serialization), when it is needed, having a default encoding (UTF-8?)

UTF-16? Why? I'd rather chose UTF-8 or UTF-32.

I don't understand why webservers should convert back Strings into UTF-8. No encoding for text response in http protocol? I read http://en.wikipedia.org/wiki/Character_encodings_in_HTML#Specifying_the_document.27s_character_encoding

Serving text in UTF-8 allows using full Unicode web content, and minimizes compatibility risks.

AFAIK, everything related to string I/O has specified encoding in some way, these days.

Angel "Java" Lopez

@ajlopez

Cheers,
Juan Vuletich

On Wed, Feb 6, 2013 at 11:59 PM, Juan Vuletich <[hidden email]> wrote:

Hi Folks,

I was'n able to jump before into the recent discussion, but I've been thinking a bit about all this. This http://en.wikipedia.org/wiki/UTF8 made me realize that using UTF8 internally to represent Unicode strings is a way to minimize required changes to existing software.

Hannes, your proposal for representing WideStrings as variableWordSubclass is like http://en.wikipedia.org/wiki/UTF-32, right? And if I remember correctly, it is what Squeak uses.

I ended sketching this to compare alternatives:

ISO 8859-15
========
pros:
-------
- We already have a lot of specific primitives in the VM
- Efficient use of memory
- Very fast #at: and #at:put:
cons:
-------
- Can only represent Latin alphabets and not full Unicode

Unicode UTF-8 (in a new variableByteSubclass)
pros:
-------
- We could reuse many existing String primitives
- It was created to allow existing code deal with Unicode with minimum changes
- Efficient use of memory
cons:
-------
- Does not allow #at:put:
- #at: is very slow, O(n) instead of O(1)

Unicode UTF-32 (in a new variableWordSubclass)
pros:
-------
- Very fast #at: and #at:put: (although I guess we don use this much... especially #at:put:, as Strings are usually regarded as immutable)
cons:
--------
- very inefficient use of memory (4 bytes for every character!)
- doesn't take advantage of existing String primitives (sloooow)

I think that for some time, our main Character/String representation should be ISO 8859-15. Switching to full Unicode would require a lot of work in the paragraph layout and text rendering engines. But we can start building a Unicode representation that could be used for webapps.

So I suggest:

- Build an Utf8String variableByteSubclass and Utf8Character. Try to use String primitives. Do not include operations such as #at:, to discourage their use.

- Make conversion to/from ISO 8859-15 lossless, by a good codification of CodePoints in regular Strings (not unlike Hannes' recent contribution).

- Use this for the Clipboard. This is pretty close to what we have now, but we'd allow using an external Unicode textEditor to build any Unicode text, and by copying and pasting into a String literal in Cuis code, we'd have something that can be converted back into UTF-8 without losing any content.

- Web servers should convert back Strings into UTF-8. This would let us handle and serve content using full Unicode, without needing to wait until our tools can display it properly.

- Use this when editing external files, if they happen to be in UTF-8 (there are good heuristics for determining this). On save, we offer the option to save as UTF-8 or ISO 8859-15.

- We can start adapting the environment to avoid using the String protocols that are a problem for UTF-8 (i.e. #at:, #at:put: and related). This would ease an eventual migration to using only UTF-8 for everything.

What do you think? Do we have a plan?

Cheers,
Juan Vuletich

_______________________________________________
Cuis mailing list
[hidden email]
http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
_______________________________________________
Cuis mailing list
[hidden email]
http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org

_______________________________________________
Cuis mailing list
[hidden email]
http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org

Angel Java Lopez

Re: Unicode in Cuis

Why UTF-16?

- Quick access, at:, at:put:, the most usual UNICODE characters are directly mapped to 2 bytes

- If some string has a out-of-band, it can be easily marked as special

- This is (AFAIK) the preferred internal representation in .NET, Java, and most C++ implementation of w_char (please confirm). So, in current standards, memory is not a problem for string representations.

In practice, UTF-16 is plain UNICODE for most the cases

On Thu, Feb 7, 2013 at 9:02 AM, Juan Vuletich <[hidden email]> wrote:

Hi Angel,

On 2/7/2013 7:07 AM, Angel Java Lopez wrote:
Hi people!

I just found:
http://www.cprogramming.com/tutorial/unicode.html

where a similar pro/cons were discussed

and

http://msdn.microsoft.com/en-us/library/windows/desktop/dd374061(v=vs.85).aspx

Well, AFAIK, unicode UTF-16 was the way adopted by C and others to represent Unicode strings in memory. According the first article, a direct representation of any Unicode character should be 4 bytes. But UTF-16 is the encoding adopted to:

- avoid waste space

- easy access to a character

I found too

http://www.evanjones.ca/unicode-in-c.html

Contrary to popular belief, it is possible for a Unicode character to require multiple 16-bit values.

What format do you use for data that goes in and out of your software, and what format do you use internally?

Then, the author points to another link

http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode

Inside your software, store text as UTF-8 or UTF-16; that is to say, pick one of the two and stick with it.

Another interesting post by tim bray

http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF

Thanks for the links.

My guess:

- Cuis Smalltalk could support both internal representation, UTF-8 AND UTF-16. A WideString class could be written. And I guess, a WideChar. Their protocols should be the same as their "byte" counterparts. What I missing: what is the problems in this approach? All problems should be in the internal implementation of WideString and WideChar, and in the points where the code should decide:

- a new string is needed, what kind of?

One that can store Unicode. Candidates are UTF-8 and UTF-32.

- I have an string object (String or WideString). I need to encode to store out of memory. What kind of encoding?

That's easy to answer: UTF-8.

Yes, it could be a lot of work, but it's the way adopted by many languages, libraries, technologies:

I don't mean to sound harsh, but we don't decide on statistics. If we did that, we would not be doing Smalltalk :) .

a clear separation btw internal representation vs encoding. The only twist, Smalltalk-like, it's the possibility of having TWO internal representations using Smalltalk capabilitities.

The other way: bite the bullit, ONE internal representation, UTF-16, all encapsulated in String (I guess it is the strategy adopted by Java and .NET). And have new methods when needed, to indicate the encoding (I guess in I/O, serialization), when it is needed, having a default encoding (UTF-8?)

UTF-16? Why? I'd rather chose UTF-8 or UTF-32.

I don't understand why webservers should convert back Strings into UTF-8. No encoding for text response in http protocol? I read http://en.wikipedia.org/wiki/Character_encodings_in_HTML#Specifying_the_document.27s_character_encoding

Serving text in UTF-8 allows using full Unicode web content, and minimizes compatibility risks.

AFAIK, everything related to string I/O has specified encoding in some way, these days.

Angel "Java" Lopez

@ajlopez

Cheers,
Juan Vuletich
On Wed, Feb 6, 2013 at 11:59 PM, Juan Vuletich <[hidden email]> wrote:

Hi Folks,

I was'n able to jump before into the recent discussion, but I've been thinking a bit about all this. This http://en.wikipedia.org/wiki/UTF8 made me realize that using UTF8 internally to represent Unicode strings is a way to minimize required changes to existing software.

Hannes, your proposal for representing WideStrings as variableWordSubclass is like http://en.wikipedia.org/wiki/UTF-32, right? And if I remember correctly, it is what Squeak uses.

I ended sketching this to compare alternatives:

ISO 8859-15
========
pros:
-------
- We already have a lot of specific primitives in the VM
- Efficient use of memory
- Very fast #at: and #at:put:
cons:
-------
- Can only represent Latin alphabets and not full Unicode

Unicode UTF-8 (in a new variableByteSubclass)
pros:
-------
- We could reuse many existing String primitives
- It was created to allow existing code deal with Unicode with minimum changes
- Efficient use of memory
cons:
-------
- Does not allow #at:put:
- #at: is very slow, O(n) instead of O(1)

Unicode UTF-32 (in a new variableWordSubclass)
pros:
-------
- Very fast #at: and #at:put: (although I guess we don use this much... especially #at:put:, as Strings are usually regarded as immutable)
cons:
--------
- very inefficient use of memory (4 bytes for every character!)
- doesn't take advantage of existing String primitives (sloooow)

I think that for some time, our main Character/String representation should be ISO 8859-15. Switching to full Unicode would require a lot of work in the paragraph layout and text rendering engines. But we can start building a Unicode representation that could be used for webapps.

So I suggest:

- Build an Utf8String variableByteSubclass and Utf8Character. Try to use String primitives. Do not include operations such as #at:, to discourage their use.

- Make conversion to/from ISO 8859-15 lossless, by a good codification of CodePoints in regular Strings (not unlike Hannes' recent contribution).

- Use this for the Clipboard. This is pretty close to what we have now, but we'd allow using an external Unicode textEditor to build any Unicode text, and by copying and pasting into a String literal in Cuis code, we'd have something that can be converted back into UTF-8 without losing any content.

- Web servers should convert back Strings into UTF-8. This would let us handle and serve content using full Unicode, without needing to wait until our tools can display it properly.

- Use this when editing external files, if they happen to be in UTF-8 (there are good heuristics for determining this). On save, we offer the option to save as UTF-8 or ISO 8859-15.

- We can start adapting the environment to avoid using the String protocols that are a problem for UTF-8 (i.e. #at:, #at:put: and related). This would ease an eventual migration to using only UTF-8 for everything.

What do you think? Do we have a plan?

Cheers,
Juan Vuletich

_______________________________________________
Cuis mailing list
[hidden email]
http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
_______________________________________________
Cuis mailing list
[hidden email]
http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
_______________________________________________
Cuis mailing list
[hidden email]
http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org

_______________________________________________
Cuis mailing list
[hidden email]
http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org

Angel Java Lopez

Re: Unicode in Cuis

Sorry, I shoul add

http://en.wikipedia.org/wiki/UTF-16

See "X uses UTF-16". I guess there are more UTF-16 implementation than UTF-32, by two reasons: lion part of characters go to the plane 0, and space. If space is not the problem, go UTF-32.

Python 3.3 will have a mixed approach, that can be tackled by Smalltalk, as I depicted in my first email

On Thu, Feb 7, 2013 at 9:09 AM, Angel Java Lopez <[hidden email]> wrote:

Why UTF-16?

- Quick access, at:, at:put:, the most usual UNICODE characters are directly mapped to 2 bytes

- If some string has a out-of-band, it can be easily marked as special

- This is (AFAIK) the preferred internal representation in .NET, Java, and most C++ implementation of w_char (please confirm). So, in current standards, memory is not a problem for string representations.
In practice, UTF-16 is plain UNICODE for most the cases
On Thu, Feb 7, 2013 at 9:02 AM, Juan Vuletich <[hidden email]> wrote:
Hi Angel,

On 2/7/2013 7:07 AM, Angel Java Lopez wrote:
Hi people!

I just found:
http://www.cprogramming.com/tutorial/unicode.html

where a similar pro/cons were discussed

and

http://msdn.microsoft.com/en-us/library/windows/desktop/dd374061(v=vs.85).aspx

Well, AFAIK, unicode UTF-16 was the way adopted by C and others to represent Unicode strings in memory. According the first article, a direct representation of any Unicode character should be 4 bytes. But UTF-16 is the encoding adopted to:

- avoid waste space

- easy access to a character

I found too

http://www.evanjones.ca/unicode-in-c.html

Contrary to popular belief, it is possible for a Unicode character to require multiple 16-bit values.

What format do you use for data that goes in and out of your software, and what format do you use internally?

Then, the author points to another link

http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode

Inside your software, store text as UTF-8 or UTF-16; that is to say, pick one of the two and stick with it.

Another interesting post by tim bray

http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF

Thanks for the links.

My guess:

- Cuis Smalltalk could support both internal representation, UTF-8 AND UTF-16. A WideString class could be written. And I guess, a WideChar. Their protocols should be the same as their "byte" counterparts. What I missing: what is the problems in this approach? All problems should be in the internal implementation of WideString and WideChar, and in the points where the code should decide:

- a new string is needed, what kind of?

One that can store Unicode. Candidates are UTF-8 and UTF-32.

- I have an string object (String or WideString). I need to encode to store out of memory. What kind of encoding?

That's easy to answer: UTF-8.

Yes, it could be a lot of work, but it's the way adopted by many languages, libraries, technologies:

I don't mean to sound harsh, but we don't decide on statistics. If we did that, we would not be doing Smalltalk :) .

a clear separation btw internal representation vs encoding. The only twist, Smalltalk-like, it's the possibility of having TWO internal representations using Smalltalk capabilitities.

The other way: bite the bullit, ONE internal representation, UTF-16, all encapsulated in String (I guess it is the strategy adopted by Java and .NET). And have new methods when needed, to indicate the encoding (I guess in I/O, serialization), when it is needed, having a default encoding (UTF-8?)

UTF-16? Why? I'd rather chose UTF-8 or UTF-32.

I don't understand why webservers should convert back Strings into UTF-8. No encoding for text response in http protocol? I read http://en.wikipedia.org/wiki/Character_encodings_in_HTML#Specifying_the_document.27s_character_encoding

Serving text in UTF-8 allows using full Unicode web content, and minimizes compatibility risks.

AFAIK, everything related to string I/O has specified encoding in some way, these days.

Angel "Java" Lopez

@ajlopez

Cheers,
Juan Vuletich
On Wed, Feb 6, 2013 at 11:59 PM, Juan Vuletich <[hidden email]> wrote:

Hi Folks,

I was'n able to jump before into the recent discussion, but I've been thinking a bit about all this. This http://en.wikipedia.org/wiki/UTF8 made me realize that using UTF8 internally to represent Unicode strings is a way to minimize required changes to existing software.

Hannes, your proposal for representing WideStrings as variableWordSubclass is like http://en.wikipedia.org/wiki/UTF-32, right? And if I remember correctly, it is what Squeak uses.

I ended sketching this to compare alternatives:

ISO 8859-15
========
pros:
-------
- We already have a lot of specific primitives in the VM
- Efficient use of memory
- Very fast #at: and #at:put:
cons:
-------
- Can only represent Latin alphabets and not full Unicode

Unicode UTF-8 (in a new variableByteSubclass)
pros:
-------
- We could reuse many existing String primitives
- It was created to allow existing code deal with Unicode with minimum changes
- Efficient use of memory
cons:
-------
- Does not allow #at:put:
- #at: is very slow, O(n) instead of O(1)

Unicode UTF-32 (in a new variableWordSubclass)
pros:
-------
- Very fast #at: and #at:put: (although I guess we don use this much... especially #at:put:, as Strings are usually regarded as immutable)
cons:
--------
- very inefficient use of memory (4 bytes for every character!)
- doesn't take advantage of existing String primitives (sloooow)

I think that for some time, our main Character/String representation should be ISO 8859-15. Switching to full Unicode would require a lot of work in the paragraph layout and text rendering engines. But we can start building a Unicode representation that could be used for webapps.

So I suggest:

- Build an Utf8String variableByteSubclass and Utf8Character. Try to use String primitives. Do not include operations such as #at:, to discourage their use.

- Make conversion to/from ISO 8859-15 lossless, by a good codification of CodePoints in regular Strings (not unlike Hannes' recent contribution).

- Use this for the Clipboard. This is pretty close to what we have now, but we'd allow using an external Unicode textEditor to build any Unicode text, and by copying and pasting into a String literal in Cuis code, we'd have something that can be converted back into UTF-8 without losing any content.

- Web servers should convert back Strings into UTF-8. This would let us handle and serve content using full Unicode, without needing to wait until our tools can display it properly.

- Use this when editing external files, if they happen to be in UTF-8 (there are good heuristics for determining this). On save, we offer the option to save as UTF-8 or ISO 8859-15.

- We can start adapting the environment to avoid using the String protocols that are a problem for UTF-8 (i.e. #at:, #at:put: and related). This would ease an eventual migration to using only UTF-8 for everything.

What do you think? Do we have a plan?

Cheers,
Juan Vuletich

_______________________________________________
Cuis mailing list
[hidden email]
http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
_______________________________________________
Cuis mailing list
[hidden email]
http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
_______________________________________________
Cuis mailing list
[hidden email]
http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org

_______________________________________________
Cuis mailing list
[hidden email]
http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org

Juan Vuletich-4

Re: Unicode in Cuis

In reply to this post by Angel Java Lopez

On 2/7/2013 9:09 AM, Angel Java Lopez wrote:

Why UTF-16?

- Quick access, at:, at:put:, the most usual UNICODE characters are directly mapped to 2 bytes

Not if you support full Unicode. So, we'd need to add a flag to each instance to say "I only use 2 byte chars", to avoid the cost of scanning the whole instance every time, to know, as that would kill the "quick access" argument. In any case, and as just said in this thread, the main question to answer is "does this issue really matter at all?"

- If some string has a out-of-band, it can be easily marked as special

That's when thing start to get messy.

- This is (AFAIK) the preferred internal representation in .NET, Java, and most C++ implementation of w_char (please confirm). So, in current standards, memory is not a problem for string representations.

I _don't_ care about the internal representation others use. I just care about which one is best.

In practice, UTF-16 is plain UNICODE for most the cases

Cheers,
Juan Vuletich

On Thu, Feb 7, 2013 at 9:02 AM, Juan Vuletich <[hidden email]> wrote:
Hi Angel,

On 2/7/2013 7:07 AM, Angel Java Lopez wrote:
Hi people!

I just found:
http://www.cprogramming.com/tutorial/unicode.html

where a similar pro/cons were discussed

and

http://msdn.microsoft.com/en-us/library/windows/desktop/dd374061(v=vs.85).aspx

Well, AFAIK, unicode UTF-16 was the way adopted by C and others to represent Unicode strings in memory. According the first article, a direct representation of any Unicode character should be 4 bytes. But UTF-16 is the encoding adopted to:

- avoid waste space

- easy access to a character

I found too

http://www.evanjones.ca/unicode-in-c.html

Contrary to popular belief, it is possible for a Unicode character to require multiple 16-bit values.

What format do you use for data that goes in and out of your software, and what format do you use internally?

Then, the author points to another link

http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode

Inside your software, store text as UTF-8 or UTF-16; that is to say, pick one of the two and stick with it.

Another interesting post by tim bray

http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF

Thanks for the links.

My guess:

- Cuis Smalltalk could support both internal representation, UTF-8 AND UTF-16. A WideString class could be written. And I guess, a WideChar. Their protocols should be the same as their "byte" counterparts. What I missing: what is the problems in this approach? All problems should be in the internal implementation of WideString and WideChar, and in the points where the code should decide:

- a new string is needed, what kind of?

One that can store Unicode. Candidates are UTF-8 and UTF-32.

- I have an string object (String or WideString). I need to encode to store out of memory. What kind of encoding?

That's easy to answer: UTF-8.

Yes, it could be a lot of work, but it's the way adopted by many languages, libraries, technologies:

I don't mean to sound harsh, but we don't decide on statistics. If we did that, we would not be doing Smalltalk :) .

a clear separation btw internal representation vs encoding. The only twist, Smalltalk-like, it's the possibility of having TWO internal representations using Smalltalk capabilitities.

The other way: bite the bullit, ONE internal representation, UTF-16, all encapsulated in String (I guess it is the strategy adopted by Java and .NET). And have new methods when needed, to indicate the encoding (I guess in I/O, serialization), when it is needed, having a default encoding (UTF-8?)

UTF-16? Why? I'd rather chose UTF-8 or UTF-32.

I don't understand why webservers should convert back Strings into UTF-8. No encoding for text response in http protocol? I read http://en.wikipedia.org/wiki/Character_encodings_in_HTML#Specifying_the_document.27s_character_encoding

Serving text in UTF-8 allows using full Unicode web content, and minimizes compatibility risks.

AFAIK, everything related to string I/O has specified encoding in some way, these days.

Angel "Java" Lopez

@ajlopez

Cheers,
Juan Vuletich
On Wed, Feb 6, 2013 at 11:59 PM, Juan Vuletich <[hidden email]> wrote:

Hi Folks,

I was'n able to jump before into the recent discussion, but I've been thinking a bit about all this. This http://en.wikipedia.org/wiki/UTF8 made me realize that using UTF8 internally to represent Unicode strings is a way to minimize required changes to existing software.

Hannes, your proposal for representing WideStrings as variableWordSubclass is like http://en.wikipedia.org/wiki/UTF-32, right? And if I remember correctly, it is what Squeak uses.

I ended sketching this to compare alternatives:

ISO 8859-15
========
pros:
-------
- We already have a lot of specific primitives in the VM
- Efficient use of memory
- Very fast #at: and #at:put:
cons:
-------
- Can only represent Latin alphabets and not full Unicode

Unicode UTF-8 (in a new variableByteSubclass)
pros:
-------
- We could reuse many existing String primitives
- It was created to allow existing code deal with Unicode with minimum changes
- Efficient use of memory
cons:
-------
- Does not allow #at:put:
- #at: is very slow, O(n) instead of O(1)

Unicode UTF-32 (in a new variableWordSubclass)
pros:
-------
- Very fast #at: and #at:put: (although I guess we don use this much... especially #at:put:, as Strings are usually regarded as immutable)
cons:
--------
- very inefficient use of memory (4 bytes for every character!)
- doesn't take advantage of existing String primitives (sloooow)

I think that for some time, our main Character/String representation should be ISO 8859-15. Switching to full Unicode would require a lot of work in the paragraph layout and text rendering engines. But we can start building a Unicode representation that could be used for webapps.

So I suggest:

- Build an Utf8String variableByteSubclass and Utf8Character. Try to use String primitives. Do not include operations such as #at:, to discourage their use.

- Make conversion to/from ISO 8859-15 lossless, by a good codification of CodePoints in regular Strings (not unlike Hannes' recent contribution).

- Use this for the Clipboard. This is pretty close to what we have now, but we'd allow using an external Unicode textEditor to build any Unicode text, and by copying and pasting into a String literal in Cuis code, we'd have something that can be converted back into UTF-8 without losing any content.

- Web servers should convert back Strings into UTF-8. This would let us handle and serve content using full Unicode, without needing to wait until our tools can display it properly.

- Use this when editing external files, if they happen to be in UTF-8 (there are good heuristics for determining this). On save, we offer the option to save as UTF-8 or ISO 8859-15.

- We can start adapting the environment to avoid using the String protocols that are a problem for UTF-8 (i.e. #at:, #at:put: and related). This would ease an eventual migration to using only UTF-8 for everything.

What do you think? Do we have a plan?

Cheers,
Juan Vuletich

_______________________________________________
Cuis mailing list
[hidden email]
http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
_______________________________________________
Cuis mailing list
[hidden email]
http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
_______________________________________________
Cuis mailing list
[hidden email]
http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
_______________________________________________
Cuis mailing list
[hidden email]
http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org

_______________________________________________
Cuis mailing list
[hidden email]
http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org

Angel Java Lopez

Re: Unicode in Cuis

So, forget full UNICODE, and support the first plane. Baby step. One internal representation, with >9x% of covering cases. If it were my personal project, I would follow this path.

But if you want to go full Unicode, I don't see the scan problem: usually, YOU MUST scan the original string, because it is encoded in some way. Once done, you know if its UTF16 has 4-byte chars or not. You must decide then if the internal representation goes UTF16 with marked as special, or plane Unicode

On Thu, Feb 7, 2013 at 9:21 AM, Juan Vuletich <[hidden email]> wrote:

On 2/7/2013 9:09 AM, Angel Java Lopez wrote:
Why UTF-16?

- Quick access, at:, at:put:, the most usual UNICODE characters are directly mapped to 2 bytes

Not if you support full Unicode. So, we'd need to add a flag to each instance to say "I only use 2 byte chars", to avoid the cost of scanning the whole instance every time, to know, as that would kill the "quick access" argument. In any case, and as just said in this thread, the main question to answer is "does this issue really matter at all?"

- If some string has a out-of-band, it can be easily marked as special

That's when thing start to get messy.

- This is (AFAIK) the preferred internal representation in .NET, Java, and most C++ implementation of w_char (please confirm). So, in current standards, memory is not a problem for string representations.

I _don't_ care about the internal representation others use. I just care about which one is best.

In practice, UTF-16 is plain UNICODE for most the cases

Cheers,
Juan Vuletich
On Thu, Feb 7, 2013 at 9:02 AM, Juan Vuletich <[hidden email]> wrote:
Hi Angel,

On 2/7/2013 7:07 AM, Angel Java Lopez wrote:
Hi people!

I just found:
http://www.cprogramming.com/tutorial/unicode.html

where a similar pro/cons were discussed

and

http://msdn.microsoft.com/en-us/library/windows/desktop/dd374061(v=vs.85).aspx

Well, AFAIK, unicode UTF-16 was the way adopted by C and others to represent Unicode strings in memory. According the first article, a direct representation of any Unicode character should be 4 bytes. But UTF-16 is the encoding adopted to:

- avoid waste space

- easy access to a character

I found too

http://www.evanjones.ca/unicode-in-c.html

Contrary to popular belief, it is possible for a Unicode character to require multiple 16-bit values.

What format do you use for data that goes in and out of your software, and what format do you use internally?

Then, the author points to another link

http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode

Inside your software, store text as UTF-8 or UTF-16; that is to say, pick one of the two and stick with it.

Another interesting post by tim bray

http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF

Thanks for the links.

My guess:

- Cuis Smalltalk could support both internal representation, UTF-8 AND UTF-16. A WideString class could be written. And I guess, a WideChar. Their protocols should be the same as their "byte" counterparts. What I missing: what is the problems in this approach? All problems should be in the internal implementation of WideString and WideChar, and in the points where the code should decide:

- a new string is needed, what kind of?

One that can store Unicode. Candidates are UTF-8 and UTF-32.

- I have an string object (String or WideString). I need to encode to store out of memory. What kind of encoding?

That's easy to answer: UTF-8.

Yes, it could be a lot of work, but it's the way adopted by many languages, libraries, technologies:

I don't mean to sound harsh, but we don't decide on statistics. If we did that, we would not be doing Smalltalk :) .

a clear separation btw internal representation vs encoding. The only twist, Smalltalk-like, it's the possibility of having TWO internal representations using Smalltalk capabilitities.

The other way: bite the bullit, ONE internal representation, UTF-16, all encapsulated in String (I guess it is the strategy adopted by Java and .NET). And have new methods when needed, to indicate the encoding (I guess in I/O, serialization), when it is needed, having a default encoding (UTF-8?)

UTF-16? Why? I'd rather chose UTF-8 or UTF-32.

I don't understand why webservers should convert back Strings into UTF-8. No encoding for text response in http protocol? I read http://en.wikipedia.org/wiki/Character_encodings_in_HTML#Specifying_the_document.27s_character_encoding

Serving text in UTF-8 allows using full Unicode web content, and minimizes compatibility risks.

AFAIK, everything related to string I/O has specified encoding in some way, these days.

Angel "Java" Lopez

@ajlopez

Cheers,
Juan Vuletich
On Wed, Feb 6, 2013 at 11:59 PM, Juan Vuletich <[hidden email]> wrote:

Hi Folks,

I was'n able to jump before into the recent discussion, but I've been thinking a bit about all this. This http://en.wikipedia.org/wiki/UTF8 made me realize that using UTF8 internally to represent Unicode strings is a way to minimize required changes to existing software.

Hannes, your proposal for representing WideStrings as variableWordSubclass is like http://en.wikipedia.org/wiki/UTF-32, right? And if I remember correctly, it is what Squeak uses.

I ended sketching this to compare alternatives:

ISO 8859-15
========
pros:
-------
- We already have a lot of specific primitives in the VM
- Efficient use of memory
- Very fast #at: and #at:put:
cons:
-------
- Can only represent Latin alphabets and not full Unicode

Unicode UTF-8 (in a new variableByteSubclass)
pros:
-------
- We could reuse many existing String primitives
- It was created to allow existing code deal with Unicode with minimum changes
- Efficient use of memory
cons:
-------
- Does not allow #at:put:
- #at: is very slow, O(n) instead of O(1)

Unicode UTF-32 (in a new variableWordSubclass)
pros:
-------
- Very fast #at: and #at:put: (although I guess we don use this much... especially #at:put:, as Strings are usually regarded as immutable)
cons:
--------
- very inefficient use of memory (4 bytes for every character!)
- doesn't take advantage of existing String primitives (sloooow)

I think that for some time, our main Character/String representation should be ISO 8859-15. Switching to full Unicode would require a lot of work in the paragraph layout and text rendering engines. But we can start building a Unicode representation that could be used for webapps.

So I suggest:

- Build an Utf8String variableByteSubclass and Utf8Character. Try to use String primitives. Do not include operations such as #at:, to discourage their use.

- Make conversion to/from ISO 8859-15 lossless, by a good codification of CodePoints in regular Strings (not unlike Hannes' recent contribution).

- Use this for the Clipboard. This is pretty close to what we have now, but we'd allow using an external Unicode textEditor to build any Unicode text, and by copying and pasting into a String literal in Cuis code, we'd have something that can be converted back into UTF-8 without losing any content.

- Web servers should convert back Strings into UTF-8. This would let us handle and serve content using full Unicode, without needing to wait until our tools can display it properly.

- Use this when editing external files, if they happen to be in UTF-8 (there are good heuristics for determining this). On save, we offer the option to save as UTF-8 or ISO 8859-15.

- We can start adapting the environment to avoid using the String protocols that are a problem for UTF-8 (i.e. #at:, #at:put: and related). This would ease an eventual migration to using only UTF-8 for everything.

What do you think? Do we have a plan?

Cheers,
Juan Vuletich

_______________________________________________
Cuis mailing list
[hidden email]
http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
_______________________________________________
Cuis mailing list
[hidden email]
http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
_______________________________________________
Cuis mailing list
[hidden email]
http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
_______________________________________________
Cuis mailing list
[hidden email]
http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
_______________________________________________
Cuis mailing list
[hidden email]
http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org

_______________________________________________
Cuis mailing list
[hidden email]
http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org

Janko Mivšek

Re: Unicode in Cuis

In reply to this post by Juan Vuletich-4

As I understand:

UTF-32 is actually no encoding, that's for me 'plain Unicode'. You need
full 32 bits for Japanese letters and because Yoshiki Oshima is
Japanese, that's why Squeak has 32bit WideString.

UTF16 is no encoding for alphabets like most East European Latin,
Cyrilic, Greek, ... But starts encoding for Japanese, Chinese, ... VW
16bit TwoByteString represents that.

UTF8 is no encoding for ASCII and few ISO8859 charsets, but starts
encoding for other alphabets. 8bit ByteString is enough for that.

So, if you use plain ASCI and some ISO8859 charsets, a ByteString is
enough. As soon as at least one character >256 is added, this complete
string is auto converted to TwoByteString in VW, to WideString in
Squeak. That way you preserve all string manipulations, #at:put: #size,
but for cost of less efficient memory consumption.

In VW a Character is subclass of Magnitude and seems to be 32bit by
default. So no problem supporting all Unicode characters/code points.

In Squeak/Pharo Character is also a subclass of Magnitude with
additional instvar #value, which seems to be 32bit as well? Note also
class vars CharacterTable, DigitValues ...

Best regards
Janko

Dne 07. 02. 2013 13:21, piše Juan Vuletich:

> On 2/7/2013 9:09 AM, Angel Java Lopez wrote:
>> Why UTF-16?
>>
>> - Quick access, at:, at:put:, the most usual UNICODE characters are
>> directly mapped to 2 bytes
>>
>
> Not if you support full Unicode. So, we'd need to add a flag to each
> instance to say "I only use 2 byte chars", to avoid the cost of scanning
> the whole instance every time, to know, as that would kill the "quick
> access" argument. In any case, and as just said in this thread, the main
> question to answer is "does this issue really matter at all?"
>
>> - If some string has a out-of-band, it can be easily marked as special
>
> That's when thing start to get messy.
>
>> - This is (AFAIK) the preferred internal representation in .NET, Java,
>> and most C++ implementation of w_char (please confirm). So, in current
>> standards, memory is not a problem for string representations.
>
> I _don't_ care about the internal representation others use. I just care
> about which one is best.
>
>> In practice, UTF-16 is plain UNICODE for most the cases
>>
>
> Cheers,
> Juan Vuletich
>
>> On Thu, Feb 7, 2013 at 9:02 AM, Juan Vuletich <[hidden email]
>> <mailto:[hidden email]>> wrote:
>>
>> Hi Angel,
>>
>>
>> On 2/7/2013 7:07 AM, Angel Java Lopez wrote:
>>> Hi people!
>>>
>>> I just found:
>>> http://www.cprogramming.com/tutorial/unicode.html
>>>
>>> where a similar pro/cons were discussed
>>>
>>> and
>>> http://msdn.microsoft.com/en-us/library/windows/desktop/dd374061(v=vs.85).aspx
>>> <http://msdn.microsoft.com/en-us/library/windows/desktop/dd374061%28v=vs.85%29.aspx>
>>>
>>> Well, AFAIK, unicode UTF-16 was the way adopted by C and others
>>> to represent Unicode strings in memory. According the first
>>> article, a direct representation of any Unicode character should
>>> be 4 bytes. But UTF-16 is the encoding adopted to:
>>> - avoid waste space
>>> - easy access to a character
>>>
>>> I found too
>>> http://www.evanjones.ca/unicode-in-c.html
>>> Contrary to popular belief, it is possible for a Unicode
>>> character to require multiple 16-bit values.
>>>
>>> What format do you use for data that goes in and out of your
>>> software, and what format do you use internally?
>>>
>>> Then, the author points to another link
>>> http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode
>>> Inside your software, store text as UTF-8 or UTF-16; that is to
>>> say, pick one of the two and stick with it.
>>>
>>> Another interesting post by tim bray
>>> http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF
>>>
>>
>> Thanks for the links.
>>
>>
>>> My guess:
>>>
>>> - Cuis Smalltalk could support both internal representation,
>>> UTF-8 AND UTF-16. A WideString class could be written. And I
>>> guess, a WideChar. Their protocols should be the same as their
>>> "byte" counterparts. What I missing: what is the problems in this
>>> approach? All problems should be in the internal implementation
>>> of WideString and WideChar, and in the points where the code
>>> should decide:
>>>
>>> - a new string is needed, what kind of?
>>
>> One that can store Unicode. Candidates are UTF-8 and UTF-32.
>>
>>
>>> - I have an string object (String or WideString). I need to
>>> encode to store out of memory. What kind of encoding?
>>
>> That's easy to answer: UTF-8.
>>
>>
>>> Yes, it could be a lot of work, but it's the way adopted by many
>>> languages, libraries, technologies:
>>
>> I don't mean to sound harsh, but we don't decide on statistics. If
>> we did that, we would not be doing Smalltalk :) .
>>
>>
>>> a clear separation btw internal representation vs encoding. The
>>> only twist, Smalltalk-like, it's the possibility of having TWO
>>> internal representations using Smalltalk capabilitities.
>>>
>>> The other way: bite the bullit, ONE internal representation,
>>> UTF-16, all encapsulated in String (I guess it is the strategy
>>> adopted by Java and .NET). And have new methods when needed, to
>>> indicate the encoding (I guess in I/O, serialization), when it is
>>> needed, having a default encoding (UTF-8?)
>>>
>>
>> UTF-16? Why? I'd rather chose UTF-8 or UTF-32.
>>
>>
>>> I don't understand why webservers should convert back Strings
>>> into UTF-8. No encoding for text response in http protocol? I
>>> read http://en.wikipedia.org/wiki/Character_encodings_in_HTML#Specifying_the_document.27s_character_encoding
>>>
>>
>> Serving text in UTF-8 allows using full Unicode web content, and
>> minimizes compatibility risks.
>>
>>
>>> AFAIK, everything related to string I/O has specified encoding in
>>> some way, these days.
>>>
>>> Angel "Java" Lopez
>>> @ajlopez
>>>
>>
>> Cheers,
>> Juan Vuletich
>>
>>
>>>
>>> On Wed, Feb 6, 2013 at 11:59 PM, Juan Vuletich
>>> <[hidden email] <mailto:[hidden email]>> wrote:
>>>
>>> Hi Folks,
>>>
>>> I was'n able to jump before into the recent discussion, but
>>> I've been thinking a bit about all this. This
>>> http://en.wikipedia.org/wiki/UTF8 made me realize that using
>>> UTF8 internally to represent Unicode strings is a way to
>>> minimize required changes to existing software.
>>>
>>> Hannes, your proposal for representing WideStrings as
>>> variableWordSubclass is like
>>> http://en.wikipedia.org/wiki/UTF-32, right? And if I remember
>>> correctly, it is what Squeak uses.
>>>
>>> I ended sketching this to compare alternatives:
>>>
>>>
>>> ISO 8859-15
>>> ========
>>> pros:
>>> -------
>>> - We already have a lot of specific primitives in the VM
>>> - Efficient use of memory
>>> - Very fast #at: and #at:put:
>>> cons:
>>> -------
>>> - Can only represent Latin alphabets and not full Unicode
>>>
>>> Unicode UTF-8 (in a new variableByteSubclass)
>>> pros:
>>> -------
>>> - We could reuse many existing String primitives
>>> - It was created to allow existing code deal with
>>> Unicode with minimum changes
>>> - Efficient use of memory
>>> cons:
>>> -------
>>> - Does not allow #at:put:
>>> - #at: is very slow, O(n) instead of O(1)
>>>
>>> Unicode UTF-32 (in a new variableWordSubclass)
>>> pros:
>>> -------
>>> - Very fast #at: and #at:put: (although I guess we don
>>> use this much... especially #at:put:, as Strings are usually
>>> regarded as immutable)
>>> cons:
>>> --------
>>> - very inefficient use of memory (4 bytes for every
>>> character!)
>>> - doesn't take advantage of existing String primitives
>>> (sloooow)
>>>
>>>
>>> I think that for some time, our main Character/String
>>> representation should be ISO 8859-15. Switching to full
>>> Unicode would require a lot of work in the paragraph layout
>>> and text rendering engines. But we can start building a
>>> Unicode representation that could be used for webapps.
>>>
>>> So I suggest:
>>>
>>> - Build an Utf8String variableByteSubclass and Utf8Character.
>>> Try to use String primitives. Do not include operations such
>>> as #at:, to discourage their use.
>>>
>>> - Make conversion to/from ISO 8859-15 lossless, by a good
>>> codification of CodePoints in regular Strings (not unlike
>>> Hannes' recent contribution).
>>>
>>> - Use this for the Clipboard. This is pretty close to what we
>>> have now, but we'd allow using an external Unicode textEditor
>>> to build any Unicode text, and by copying and pasting into a
>>> String literal in Cuis code, we'd have something that can be
>>> converted back into UTF-8 without losing any content.
>>>
>>> - Web servers should convert back Strings into UTF-8. This
>>> would let us handle and serve content using full Unicode,
>>> without needing to wait until our tools can display it properly.
>>>
>>> - Use this when editing external files, if they happen to be
>>> in UTF-8 (there are good heuristics for determining this). On
>>> save, we offer the option to save as UTF-8 or ISO 8859-15.
>>>
>>> - We can start adapting the environment to avoid using the
>>> String protocols that are a problem for UTF-8 (i.e. #at:,
>>> #at:put: and related). This would ease an eventual migration
>>> to using only UTF-8 for everything.
>>>
>>> What do you think? Do we have a plan?
>>>
>>> Cheers,
>>> Juan Vuletich
>>>
>>> _______________________________________________
>>> Cuis mailing list
>>> [hidden email] <mailto:[hidden email]>
>>> http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
>>>
>>>
>>>
>>> _______________________________________________
>>> Cuis mailing list
>>> [hidden email] <mailto:[hidden email]>
>>> http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
>>
>>
>> _______________________________________________
>> Cuis mailing list
>> [hidden email] <mailto:[hidden email]>
>> http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
>>
>>
>>
>> _______________________________________________
>> Cuis mailing list
>> [hidden email]
>> http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
>
>
>
> _______________________________________________
> Cuis mailing list
> [hidden email]
> http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
>

KenDickey

Re: Unicode in Cuis

In reply to this post by Juan Vuletich-4

On Thu, 07 Feb 2013 09:21:42 -0300
Juan Vuletich <[hidden email]> wrote:

> Not if you support full Unicode. So, we'd need to add a flag to each
> instance to say "I only use 2 byte chars", to avoid the cost of scanning
> the whole instance every time, to know, as that would kill the "quick
> access" argument.

Actually, the quick thing to do is have multiple representations and let Smalltalk do the (double) dispatching as required.

I would recommend that Unicode support be almost exclusively a separate library, with minimal changes to core. A small add-on change-set like I did for Color could be used to allow loading the Unicode library(s) with no changes to core and little impact on Juan's time until things look good.

I know Juan's tendency to peek and comment, but it would not be required. ;^)

My suggestion is a new alternate reader for Unicode which scans the text and auto-converts input into one of 3 categories of internal representation: current (or utf-8), utf-16, or utf-32, each of which is "pure". I.e. only one encoding size in each case. If I "squash" a Chinese character into a current/utf-8 string, the result is a new, utf-32 string. So we give up mutation in favor of functional strings, bUt access is O(1).

This also means 3 classes of characters for 8, 16, or 32 bits along with the cross-product of comparison, up/down-case and (much work in Unicode) sorting. I suggest we just sort on "code points" in the short term.

This has the advantage of testing simplicity. Doing multiple representations where characters are a mix of 8/16/32 bits gives too much work with edge cases in unit tests. Too much time would be spent "testing tests".

The string writer could output in current or utf-8 with utf-16 and utf-32 added later.

Proper character display is addressed separately only after Unicode support is implemented and properly working.

The implementation would not be optimal w.r.t. memory utilization, but would minimize development (including unit test writing) time. Complex handling of mixed size characters in strings could be done later as an optimization -- but only if/when this becomes useful.

There is an issue of mutable strings (e.g. in text editing). There are several ways of approaching this, including a "caching" implementation which converts all to utf-32 for fast access and optionally back and so forth. Again, this can be approached on an as-needed basis after full (but simple) Unicode support is in place at the Character and String level.

$0.02,
-KenD

_______________________________________________
Cuis mailing list
[hidden email]
http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org

-KenD

Hannes Hirzel

Re: Unicode in Cuis

On 2/7/13, KenD <[hidden email]> wrote:

+1 for a separate library, with minimal changes to core.

A small add-on change-set like I did
> for Color could be used to allow loading the Unicode library(s) with no
> changes to core and little impact on Juan's time until things look good.

+1 for little impact on Juan's time so that he can continue to focus on Morphic.

> I know Juan's tendency to peek and comment, but it would not be required.
> ;^)
>
> My suggestion is a new alternate reader for Unicode which scans the text and
> auto-converts input into one of 3 categories of internal representation:
> current (or utf-8), utf-16, or utf-32, each of which is "pure". I.e. only
> one encoding size in each case. If I "squash" a Chinese character into a
> current/utf-8 string, the result is a new, utf-32 string. So we give up
> mutation in favor of functional strings, bUt access is O(1).
>
> This also means 3 classes of characters for 8, 16, or 32 bits along with the
> cross-product of comparison, up/down-case and (much work in Unicode)
> sorting. I suggest we just sort on "code points" in the short term.
>
> This has the advantage of testing simplicity. Doing multiple
> representations where characters are a mix of 8/16/32 bits gives too much
> work with edge cases in unit tests. Too much time would be spent "testing
> tests".
>
> The string writer could output in current or utf-8 with utf-16 and utf-32
> added later.
>
> Proper character display is addressed separately only after Unicode support
> is implemented and properly working.
>
> The implementation would not be optimal w.r.t. memory utilization, but would
> minimize development (including unit test writing) time. Complex handling
> of mixed size characters in strings could be done later as an optimization
> -- but only if/when this becomes useful.
>
> There is an issue of mutable strings (e.g. in text editing). There are
> several ways of approaching this, including a "caching" implementation which
> converts all to utf-32 for fast access and optionally back and so forth.
> Again, this can be approached on an as-needed basis after full (but simple)
> Unicode support is in place at the Character and String level.
>
> $0.02,
> -KenD
>
> _______________________________________________
> Cuis mailing list
> [hidden email]
> http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
>

_______________________________________________
Cuis mailing list
[hidden email]
http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org

Juan Vuletich-4

Re: Unicode in Cuis

In reply to this post by KenDickey

On 2/7/2013 1:45 PM, KenD wrote:

> On Thu, 07 Feb 2013 09:21:42 -0300
> Juan Vuletich<[hidden email]> wrote:
>
>> Not if you support full Unicode. So, we'd need to add a flag to each
>> instance to say "I only use 2 byte chars", to avoid the cost of scanning
>> the whole instance every time, to know, as that would kill the "quick
>> access" argument.
> Actually, the quick thing to do is have multiple representations and let Smalltalk do the (double) dispatching as required.
>
> I would recommend that Unicode support be almost exclusively a separate library, with minimal changes to core. A small add-on change-set like I did for Color could be used to allow loading the Unicode library(s) with no changes to core and little impact on Juan's time until things look good.
>
> I know Juan's tendency to peek and comment, but it would not be required. ;^)

:)

I'll try to let you guys find the best approach.

Yesterday I played a bit to see how quickly the system would break if
#at: and #at:put: were banned for Strings. It breaks rather quickly. So,
converting everything to use only UTF-8 as I had suggested is not a
trivial task, and you are right that I'd better focus on Morphic 3...

> My suggestion is a new alternate reader for Unicode which scans the text and auto-converts input into one of 3 categories of internal representation: current (or utf-8), utf-16, or utf-32, each of which is "pure". I.e. only one encoding size in each case. If I "squash" a Chinese character into a current/utf-8 string, the result is a new, utf-32 string. So we give up mutation in favor of functional strings, bUt access is O(1).
>
> This also means 3 classes of characters for 8, 16, or 32 bits along with the cross-product of comparison, up/down-case and (much work in Unicode) sorting. I suggest we just sort on "code points" in the short term.
>
> This has the advantage of testing simplicity. Doing multiple representations where characters are a mix of 8/16/32 bits gives too much work with edge cases in unit tests. Too much time would be spent "testing tests".
>
> The string writer could output in current or utf-8 with utf-16 and utf-32 added later.
>
> Proper character display is addressed separately only after Unicode support is implemented and properly working.
>
> The implementation would not be optimal w.r.t. memory utilization, but would minimize development (including unit test writing) time. Complex handling of mixed size characters in strings could be done later as an optimization -- but only if/when this becomes useful.
>
> There is an issue of mutable strings (e.g. in text editing). There are several ways of approaching this, including a "caching" implementation which converts all to utf-32 for fast access and optionally back and so forth. Again, this can be approached on an as-needed basis after full (but simple) Unicode support is in place at the Character and String level.
>
> $0.02,
> -KenD
>
> _______________________________________________
> Cuis mailing list
> [hidden email]
> http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
>
>
> -----
> Se certifico que el correo no contiene virus.
> Comprobada por AVG - www.avg.es
> Version: 2013.0.2897 / Base de datos de virus: 2639/6087 - Fecha de la version: 07/02/2013
>
>

Cheers,
Juan Vuletich

_______________________________________________
Cuis mailing list
[hidden email]
http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org