Hi all
I would like to know how I can create an UTF-* character composed for example of two bytes 16rC3 and 16rBC I tried WideString fromByteArray: { 16rC3 . 16rBC } Stef |
On Tue, 2008-09-23 at 10:46 +0200, stephane ducasse wrote:
> Hi all > > I would like to know how I can create an UTF-* character composed for > example of two bytes > > 16rC3 and 16rBC > > I tried > > WideString fromByteArray: { 16rC3 . 16rBC } > > Stef > Hmm, I'm not sure what you mean by UTF-* Character but this way it works ( ( String fromByteArray: ( ByteArray with: 16rC3 with: 16rBC ) ) convertFromEncoding: #utf8 ) at: 1 And it is not a two-byte character because it is a character that is contained in latin-1. I thought there would be an easier/better way to do! Bert? :) Norbert |
In reply to this post by stephane ducasse
Am 23.09.2008 um 01:46 schrieb stephane ducasse:
> Hi all > > I would like to know how I can create an UTF-* character composed > for example of two bytes > > 16rC3 and 16rBC > > I tried > > WideString fromByteArray: { 16rC3 . 16rBC } > > Stef There is no such thing as a "UTF-*" character. There are Unicode Characters, and Unicode Strings, and there are UTF-encoded string (UTF means Unicode Transformation Format). All characters in Squeak use Unicode now. For example, the cyrillic Б is char := Character value: 16r0411. this can be made into a String: wideString := String with: char. which of course has the same Unicode code points: wideString asArray collect: [:each | each hex] gives #('16r411') The string can be encoded as UTF-8: utf8String := wideString squeakToUtf8. and to see the values there utf8String asArray collect: [:each | each hex] yields #('16rD0' '16r91') which is the UTF-8 representation of the character we began with (but if you try to pront utf8String directly you get nonsense, because Squeak does not know it is UTF-8 encoded). The decoding of UTF-8 to a String is similar: #(16rC3 16rBC) asByteArray asString utf8ToSqueak which returns the String 'ü' and probably is what you wanted in the first place - but please try to understand and use the Unicode terms correctly to minimize confusion. Anyway, to convert between a String in UTF-8 and a regular Squeak String, it's simplest to use utf8ToSqueak and squeakToUtf8. - Bert - |
On Tue, 2008-09-23 at 06:48 -0700, Bert Freudenberg wrote:
> Am 23.09.2008 um 01:46 schrieb stephane ducasse: > > > Hi all > > > > I would like to know how I can create an UTF-* character composed > > for example of two bytes > > > > 16rC3 and 16rBC > > > > I tried > > > > WideString fromByteArray: { 16rC3 . 16rBC } > > > > Stef > > There is no such thing as a "UTF-*" character. There are Unicode > Characters, and Unicode Strings, and there are UTF-encoded string (UTF > means Unicode Transformation Format). > > All characters in Squeak use Unicode now. For example, the cyrillic Б > is > > char := Character value: 16r0411. > > this can be made into a String: > > wideString := String with: char. > > which of course has the same Unicode code points: > > wideString asArray collect: [:each | each hex] > > gives > > #('16r411') > > The string can be encoded as UTF-8: > > utf8String := wideString squeakToUtf8. > > and to see the values there > > utf8String asArray collect: [:each | each hex] > > yields > > #('16rD0' '16r91') > > which is the UTF-8 representation of the character we began with (but > if you try to pront utf8String directly you get nonsense, because > Squeak does not know it is UTF-8 encoded). > > The decoding of UTF-8 to a String is similar: > > #(16rC3 16rBC) asByteArray asString utf8ToSqueak > (and more of this "strange method stuff"[tm]). > which returns the String 'ü' and probably is what you wanted in the > first place - but please try to understand and use the Unicode terms > correctly to minimize confusion. > > Anyway, to convert between a String in UTF-8 and a regular Squeak > String, it's simplest to use utf8ToSqueak and squeakToUtf8. > > - Bert - > Norbert P.S.: My only hope is that with my knowledge getting bigger and pharo's getting smaller that we meet somewhere in between!!! |
In reply to this post by stephane ducasse
On Tuesday 23 Sep 2008 2:16:43 pm stephane ducasse wrote:
> I would like to know how I can create an UTF-* character composed for > example of two bytes > > 16rC3 and 16rBC > > I tried > > WideString fromByteArray: { 16rC3 . 16rBC } alphaBeta := WideString from: #(945 946). gives me a Squeak wide string containing Greek alpha and beta. The numbers are from Unicode BMP for Greek. alphabeta squeakToUtf8 asByteArray yields the UTF-8 sequence #(206 177 206 178) and #(206 177 206 178) asString utf8ToSqueak gives me back the original string. Of course, you should turn on "usePangoRenderer" preference to see characters rendered correctly for characters other than Latin-1. HTH .. Subbu |
In reply to this post by Bert Freudenberg
2008/9/23 Bert Freudenberg <[hidden email]>:
> Am 23.09.2008 um 01:46 schrieb stephane ducasse: > >> Hi all >> >> I would like to know how I can create an UTF-* character composed for >> example of two bytes >> >> 16rC3 and 16rBC >> >> I tried >> >> WideString fromByteArray: { 16rC3 . 16rBC } >> >> Stef > > There is no such thing as a "UTF-*" character. There are Unicode Characters, > and Unicode Strings, and there are UTF-encoded string (UTF means Unicode > Transformation Format). > > All characters in Squeak use Unicode now. For example, the cyrillic Б is > > char := Character value: 16r0411. > > this can be made into a String: > > wideString := String with: char. > > which of course has the same Unicode code points: > > wideString asArray collect: [:each | each hex] > > gives > > #('16r411') > > The string can be encoded as UTF-8: > > utf8String := wideString squeakToUtf8. > > and to see the values there > > utf8String asArray collect: [:each | each hex] > > yields > > #('16rD0' '16r91') > > which is the UTF-8 representation of the character we began with (but if you > try to pront utf8String directly you get nonsense, because Squeak does not > know it is UTF-8 encoded). > > The decoding of UTF-8 to a String is similar: > > #(16rC3 16rBC) asByteArray asString utf8ToSqueak > > which returns the String 'ü' and probably is what you wanted in the first > place - but please try to understand and use the Unicode terms correctly to > minimize confusion. > > Anyway, to convert between a String in UTF-8 and a regular Squeak String, > it's simplest to use utf8ToSqueak and squeakToUtf8. Squeak in the form of #convertTo/FromEncoding? Convert from "Squeak" to UTF-8 aString convertToEncoding: 'utf-8' Convert from UTF-8 to "Squeak" aString converFromEncoding: 'utf-8' For checking out all the encodings your image supports: TextConverter allEncodingNames Cheers Philippe |
Is there a reason (other than history) why Strings are not collections
of unicode characters (at least as viewed from outside) rather than bytes in some unknown encoding (which should be encapsulated and only appear when text goes in and out the image) ? Or is it already like that ? On Tue, Sep 23, 2008 at 7:49 PM, Philippe Marschall <[hidden email]> wrote: > 2008/9/23 Bert Freudenberg <[hidden email]>: >> Am 23.09.2008 um 01:46 schrieb stephane ducasse: >> >>> Hi all >>> >>> I would like to know how I can create an UTF-* character composed for >>> example of two bytes >>> >>> 16rC3 and 16rBC >>> >>> I tried >>> >>> WideString fromByteArray: { 16rC3 . 16rBC } >>> >>> Stef >> >> There is no such thing as a "UTF-*" character. There are Unicode Characters, >> and Unicode Strings, and there are UTF-encoded string (UTF means Unicode >> Transformation Format). >> >> All characters in Squeak use Unicode now. For example, the cyrillic Б is >> >> char := Character value: 16r0411. >> >> this can be made into a String: >> >> wideString := String with: char. >> >> which of course has the same Unicode code points: >> >> wideString asArray collect: [:each | each hex] >> >> gives >> >> #('16r411') >> >> The string can be encoded as UTF-8: >> >> utf8String := wideString squeakToUtf8. >> >> and to see the values there >> >> utf8String asArray collect: [:each | each hex] >> >> yields >> >> #('16rD0' '16r91') >> >> which is the UTF-8 representation of the character we began with (but if you >> try to pront utf8String directly you get nonsense, because Squeak does not >> know it is UTF-8 encoded). >> >> The decoding of UTF-8 to a String is similar: >> >> #(16rC3 16rBC) asByteArray asString utf8ToSqueak >> >> which returns the String 'ü' and probably is what you wanted in the first >> place - but please try to understand and use the Unicode terms correctly to >> minimize confusion. >> >> Anyway, to convert between a String in UTF-8 and a regular Squeak String, >> it's simplest to use utf8ToSqueak and squeakToUtf8. > > Am I the only one using the generic en/decoding functionality in > Squeak in the form of #convertTo/FromEncoding? > > Convert from "Squeak" to UTF-8 > aString convertToEncoding: 'utf-8' > > Convert from UTF-8 to "Squeak" > aString converFromEncoding: 'utf-8' > > For checking out all the encodings your image supports: > TextConverter allEncodingNames > > Cheers > Philippe > > > > -- Damien Pollet type less, do more [ | ] http://people.untyped.org/damien.pollet |
At Wed, 24 Sep 2008 10:49:18 +0200,
Damien Pollet wrote: > > Is there a reason (other than history) why Strings are not collections > of unicode characters (at least as viewed from outside) rather than > bytes in some unknown encoding (which should be encapsulated and only > appear when text goes in and out the image) ? Or is it already like > that ? I think the answer is that it is already *like that*, although I can't tell what you mean by "from outside". In the image, a ByteString or WideString is a sequence of characters that hold Unicode code points. (Note that a Unicode code point is 21-bit.) if all the code in a string fits within 8-bit, we use ByteString. if it doesn't it uses WideString, but the distinction is more or less hidden from a casual user. The conversion is only needed when the String is interfacing with the outside of the image. A Unicode code point doesn't really corresponds to the concept of a character, if you think an accented character a "character". The original concept of Unicode was that such "character" should be always represented as the sequence of code points; one base character, and one or more accent marks. It was at least pure and fair. But, they got the "Latin-1 compatibility" idea around 1990 in a retrofitted way; so the original idea of "Let us make a universal character set for everybody in the world" was turned to: "Let us make a universal character set for everybody in the world, but let's treat Westerners nicer." But of course this turn made the situation where a simple accented character has two (precomposed and decomposed) representations. Squeak is still way behind and prefers the precomposed "normalization", but the normalization is really lax there. To me, the han unification is another evidence of "Westerners first" idea. If tracing back to the origin of characters is the concept, i and j should be perhaps unified as well (just kidding). But, Unicode is the standard now, and it does solve a lot of problems. So using it as the base but putting necessary information around it to support it is a good way in principle. If so, one could argue that we can just hold every string in decomposed UTF-8 in the image, and have a couple of variants of at: and at:put:. The requirement of O(1) random access is not that important. I might go that direction if I redo it now. -- Yoshiki |
On 24-Sep-08, at 2:26 AM, Yoshiki Ohshima wrote: > At Wed, 24 Sep 2008 10:49:18 +0200, > Damien Pollet wrote: >> >> Is there a reason (other than history) why Strings are not >> collections >> of unicode characters (at least as viewed from outside) rather than >> bytes in some unknown encoding (which should be encapsulated and only >> appear when text goes in and out the image) ? Or is it already like >> that ? > > I think the answer is that it is already *like that*, although I > can't tell what you mean by "from outside". I think Damien's confusion comes from the fact that the abstractions are a bit leaky. For example, if you do something like this: 'ábc' convertToEncoding: 'utf-8' the result is 'ábc'. It's a string where the internal, "encapsulated" state is such that writing it to a socket or file will produce the desired bytes, but all in-image behavior is totally broken. VisualWorks tends to do a better job of maintaining the abstractions, I think. The equivalent of the above example would product a ByteArray. > If so, one could argue that we can just hold every string in > decomposed UTF-8 in the image, and have a couple of variants of at: > and at:put:. The requirement of O(1) random access is not that > important. I might go that direction if I redo it now. A UTF8String would be really handy for web applications, where strings come in from the net as UTF-8, live in the image for a while, then get sent out as UTF-8. O(1) random access isn't very useful, because strings are (mostly) uninterpreted, but converting to Squeak's internal representation is expensive. The thing is, as long as the "sequence of characters" abstraction is maintained, it doesn't matter (for purposes of correct behavior) what the internal representation is. So it's perfectly reasonable to have multiple encodings with different performance profiles. UTF8String when you need it, WideString when that makes sense. Colin |
In reply to this post by Yoshiki Ohshima-2
On Wednesday 24 Sep 2008 2:56:43 pm Yoshiki Ohshima wrote:
> In the image, a ByteString or WideString is a sequence of characters > that hold Unicode code points. (Note that a Unicode code point is > 21-bit.) if all the code in a string fits within 8-bit, we use > ByteString. if it doesn't it uses WideString You mean a sequence of code points? Instances of Character hold only one code point (value), while some characters need more than one code point (e.g. ksha in Devanagari needs three). Subbu |
In reply to this post by Colin Putney
At Wed, 24 Sep 2008 07:45:38 -0700,
Colin Putney wrote: > > A UTF8String would be really handy for web applications, where strings > come in from the net as UTF-8, live in the image for a while, then get > sent out as UTF-8. O(1) random access isn't very useful, because > strings are (mostly) uninterpreted, but converting to Squeak's > internal representation is expensive. > > The thing is, as long as the "sequence of characters" abstraction is > maintained, it doesn't matter (for purposes of correct behavior) what > the internal representation is. So it's perfectly reasonable to have > multiple encodings with different performance profiles. UTF8String > when you need it, WideString when that makes sense. The thing is though, that even from the net UTF-8 is not as dominant as like that. There are bunch of other encoding used. And, have UTF8String and WideString causes the comparison etc. more complicated than it should. Have a single internal representation is cleaner. Have the encoded data in ByteArray is sensible thing to do. That would have been much bigger redesign of Squeak, though. -- Yoshiki |
In reply to this post by K. K. Subramaniam
At Wed, 24 Sep 2008 20:38:18 +0530,
K. K. Subramaniam wrote: > > On Wednesday 24 Sep 2008 2:56:43 pm Yoshiki Ohshima wrote: > > In the image, a ByteString or WideString is a sequence of characters > > that hold Unicode code points. (Note that a Unicode code point is > > 21-bit.) if all the code in a string fits within 8-bit, we use > > ByteString. if it doesn't it uses WideString > You mean a sequence of code points? Instances of Character hold only one code > point (value), while some characters need more than one code point (e.g. ksha > in Devanagari needs three). Yes, a sequence of code points, as rephrased below the email. -- Yoshiki |
In reply to this post by Bert Freudenberg
>> There is no such thing as a "UTF-*" character. There are Unicode
>> Characters, and Unicode Strings, and there are UTF-encoded string >> (UTF means Unicode Transformation Format). Yes I was sloppy. Thanks for the answer > All characters in Squeak use Unicode now. Do you mean that the characters are all encoded using codepoints values? can you tell me what the "now" refers to? OLPC? 3.8? I wanted to chekc the changes made in OLPC and harvest them in Pharo. Now do you know if there are some tests somehwere? > For example, the cyrillic Б is > > char := Character value: 16r0411. > > this can be made into a String: > > wideString := String with: char. when I do char printString I block Squeak 3.9. :( > > > which of course has the same Unicode code points: > > wideString asArray collect: [:each | each hex] > > gives > > #('16r411') Here you are talking about codepoint How do I get the corresponding glyph? Using an encoding I imagine > The string can be encoded as UTF-8: > > utf8String := wideString squeakToUtf8. > > and to see the values there > > utf8String asArray collect: [:each | each hex] > > yields > > #('16rD0' '16r91') > > which is the UTF-8 representation of the character we began with > (but if you try to pront utf8String directly you get nonsense, > because Squeak does not know it is UTF-8 encoded). ok > > > The decoding of UTF-8 to a String is similar: > > #(16rC3 16rBC) asByteArray asString utf8ToSqueak > > which returns the String 'ü' and probably is what you wanted in the > first place Why do I get a visual representation? How the mapping is done from the unicode to the glyph. Should we always passed via a transformation? How the encodings schema (UTF-*) associates a code point to its glyph? > - but please try to understand and use the Unicode terms correctly > to minimize confusion. I learned that over last weeks, reading a lot of docs. character sets ~= character encodings > > Anyway, to convert between a String in UTF-8 and a regular Squeak > String, it's simplest to use utf8ToSqueak and squeakToUtf8. Now utf-8 was just an example. I would like to know what is a *ToSqueak? I can understand that characters are code points in Unicode system now how to get see their visual representation. |
In reply to this post by Philippe Marschall
>>
> Am I the only one using the generic en/decoding functionality in > Squeak in the form of #convertTo/FromEncoding? > > Convert from "Squeak" to UTF-8 > aString convertToEncoding: 'utf-8' do I understand correctly that such a aString is a sequence of unicode codepoints? > > > Convert from UTF-8 to "Squeak" > aString converFromEncoding: 'utf-8' > > For checking out all the encodings your image supports: > TextConverter allEncodingNames > > Cheers > Philippe > |
On Sat, 2008-09-27 at 08:18 +0200, stephane ducasse wrote:
> >> > > Am I the only one using the generic en/decoding functionality in > > Squeak in the form of #convertTo/FromEncoding? > > > > Convert from "Squeak" to UTF-8 > > aString convertToEncoding: 'utf-8' > > > do I understand correctly that such a aString is a sequence of unicode > codepoints? > > optimzed encoding of a code point (utf-8). If you decode those bytes you get your code point (unicode). From a sequence of code points you can derive a character. In most cases (for us westerners) it will be a single code point AFAIK. Norbert > > > > Convert from UTF-8 to "Squeak" > > aString converFromEncoding: 'utf-8' > > > > For checking out all the encodings your image supports: > > TextConverter allEncodingNames > > > > Cheers > > Philippe > > > > |
In reply to this post by stephane ducasse
On Saturday 27 Sep 2008 11:45:38 am stephane ducasse wrote:
> Why do I get a visual representation? How the mapping is done from the > unicode to the glyph. Unicode codepoints are processed by a shaping engine to generate a graphic. The term 'glyph' (carving in Greek) is historical since typefaces were carved from metal. The shaping engine is trivial in the case of Latin-1 character set. The first 256 code points are same as Extended ASCII and the graphic can be looked up in a font table. Rendering "hello" on the screen involves extracting the box dimensions and graphic of h, e, l, o from a font table, laying out five boxes and then rendering appropriately into the five boxes. Other languages have thousands of such graphics (pictals?) and the rendering algorithms are complex enough to require a shaping engine with pluggable rendering algorithms. google for Dr. Yannis Haralambous works for details. > Should we always passed via a transformation? UTF-8 is recommended when passing Unicode strings across programs and machines for the sake of backward compatibility. Within a program, the choice of encoding depends on the string handling requirements. For instance, if a program deals with palindromes, then an encoding for "rés" like: <r> <grave> <e> <s> will break current algorithms that just reverse the string of codepoints. > How the encodings schema (UTF-*) associates a code point to its glyph? The Unicode sequence "hello world" transformed into UTF-8 is same as its Extended ASCII encoding. The process is more involved for Asian languages, so a separate shaping engine is required. Examples are Pango, Qt shaping engine, Uniscribe etc. Regards .. Subbu |
In reply to this post by stephane ducasse
2008/9/27 stephane ducasse <[hidden email]>:
>>> >> Am I the only one using the generic en/decoding functionality in >> Squeak in the form of #convertTo/FromEncoding? >> >> Convert from "Squeak" to UTF-8 >> aString convertToEncoding: 'utf-8' > > > do I understand correctly that such a aString is a sequence of unicode > codepoints? Plus leading char. If you look at UTF8TextConverter it will give every incoming character with an index higher than 255 the language of the image. I don't need to explain why this is problematic in the context of a web application, do I? Cheers Philippe |
Philippe Marschall wrote:
> 2008/9/27 stephane ducasse <[hidden email]>: >> do I understand correctly that such a aString is a sequence of unicode >> codepoints? > > Plus leading char. If you look at UTF8TextConverter it will give every > incoming character with an index higher than 255 the language of the > image. I don't need to explain why this is problematic in the context > of a web application, do I? Actually, it *is* worthwhile to explain this. The problem is that since UTF-8 doesn't have the notion of a leading char there is no way to tag incoming data correctly. The leading char will be taken from the running image, so an image running in the US (like our servers) will tag input coming from Chinese browsers as Latin1. In these situations the leading char isn't just useless, it is actively misleading. Cheers, - Andreas |
At Sat, 27 Sep 2008 10:14:39 -0700,
Andreas Raab wrote: > > Philippe Marschall wrote: > > 2008/9/27 stephane ducasse <[hidden email]>: > >> do I understand correctly that such a aString is a sequence of unicode > >> codepoints? > > > > Plus leading char. If you look at UTF8TextConverter it will give every > > incoming character with an index higher than 255 the language of the > > image. I don't need to explain why this is problematic in the context > > of a web application, do I? > > Actually, it *is* worthwhile to explain this. The problem is that since > UTF-8 doesn't have the notion of a leading char there is no way to tag > incoming data correctly. The leading char will be taken from the running > image, so an image running in the US (like our servers) will tag input > coming from Chinese browsers as Latin1. In these situations the leading > char isn't just useless, it is actively misleading. For that kind of web applications and servers that deals with stuff outside of Squeak, it doesn't serve a good purpose, because editting, displaying etc. are out of scope. Needless to say, the original idea was to make Squeak to be the dynamic, interactive, multilingualized, environment so there is mismatch. Web applications etc. historically comes after the goal. If you need to retain these extra information, sending the strings without going through UTF-8 conversion makes more sense. -- Yoshiki |
In reply to this post by Philippe Marschall
On Sat, Sep 27, 2008 at 7:05 PM, Philippe Marschall
<[hidden email]> wrote: > Plus leading char. You mean the BOM (byte order mark) or something else ? -- Damien Pollet type less, do more [ | ] http://people.untyped.org/damien.pollet |
Free forum by Nabble | Edit this page |