Smalltalk › Pharo › Pharo Smalltalk Developers

squeakToUTF-8 and related?

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

33 messages Options

Nicolas Cellier

Re: squeakToUTF-8 and related?

2010/3/30 Norbert Hartl <[hidden email]>:

>
> On 29.03.2010, at 11:52, Nicolas Cellier wrote:
>
> 2010/3/29 Henrik Johansen <[hidden email]>:
>
> On Mar 29, 2010, at 11:16 30AM, Nicolas Cellier wrote:
>
> I presume that under the idiom "latin1" you refer to code page 1252
>
> rather than iso8859-L1, right ?
>
> Nicolas
>
> Good question :)
>
> What IS the presumed internal encoding of Bytestrings in Squeak?
>
> That's the one I meant, I merely assumed it was latin1 seeing as how the
> text converter refers to it as such.
>
> Personally I thought it was iso8859-L1, seeing as the bytestring to unicode
> conversion does a simple shift of chars > 127 to the 0080 - 00FF range.
>
> Cheers,
>
> Henry
>
>
> From what I understood, CP1252 is Microsoft "latin1" and use codes 128 to
> 159.
> ISO8859-L1 match fisrt 256 codes of unicode latin-1 and has codes 128
> to 159 unused.
> You know, when Microsoft "uses" a standard, it's always a better standard ;)
>
> I have nothing against CP1252, it's an optimization which avoid
> wasting 32 cheap codes.
> But I'm not sure about various compatibility issues in/with the
> external world...
>
> If you know how to easily assure that
> (String with: (Character value: (Integer readFrom: '20AC' base: 16)))
> = (String with: (Character value: (Integer readFrom: '80' base: 16)))
> than you might be safe. By using Windows-1252 code points aren't unique
> anymore. Every code point in the range 0x80 - 0x9F exists somewhere else,
> too. So my estimation would be that it will cause more trouble than it might
> solve.
>

Agree.
I see two different problems here:
1) absence of explicit encoding information in external data
2) existence of a canonical representation which can be easily compared...

Generalization of UTF8 should solve 1 (slowly with lot of inertia),
then we can simply assume implicit=UTF8.
Unicode could solve 2...
...Well, as long as diacriticals are ignored.
To me Unicode still has problems with:
(String with: 16r61 asCharacter with: 16r0302 asCharacter) = (String
with: 16rE2 asCharacter)

Nicolas

> Squeak clearly uses CP1252.
> For Pharo, there might be a mix of the two since Sophie-like
> refactorings. Surely what John was refering to.
>
> In pharo the 20AC string gives me a euro sign but the 80 hex one prints a
> rectangle which is _a_ interpretation of '?' ;)
> Norbert
>
> _______________________________________________
>
> Pharo-project mailing list
>
> [hidden email]
>
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

NorbertHartl

Re: squeakToUTF-8 and related?

On 30.03.2010, at 11:00, Nicolas Cellier wrote:

> 2010/3/30 Norbert Hartl <[hidden email]>:
>>
>> On 29.03.2010, at 11:52, Nicolas Cellier wrote:
>>
>> 2010/3/29 Henrik Johansen <[hidden email]>:
>>
>> On Mar 29, 2010, at 11:16 30AM, Nicolas Cellier wrote:
>>
>> I presume that under the idiom "latin1" you refer to code page 1252
>>
>> rather than iso8859-L1, right ?
>>
>> Nicolas
>>
>> Good question :)
>>
>> What IS the presumed internal encoding of Bytestrings in Squeak?
>>
>> That's the one I meant, I merely assumed it was latin1 seeing as how the
>> text converter refers to it as such.
>>
>> Personally I thought it was iso8859-L1, seeing as the bytestring to unicode
>> conversion does a simple shift of chars > 127 to the 0080 - 00FF range.
>>
>> Cheers,
>>
>> Henry
>>
>>
>> From what I understood, CP1252 is Microsoft "latin1" and use codes 128 to
>> 159.
>> ISO8859-L1 match fisrt 256 codes of unicode latin-1 and has codes 128
>> to 159 unused.
>> You know, when Microsoft "uses" a standard, it's always a better standard ;)
>>
>> I have nothing against CP1252, it's an optimization which avoid
>> wasting 32 cheap codes.
>> But I'm not sure about various compatibility issues in/with the
>> external world...
>>
>> If you know how to easily assure that
>> (String with: (Character value: (Integer readFrom: '20AC' base: 16)))
>> = (String with: (Character value: (Integer readFrom: '80' base: 16)))
>> than you might be safe. By using Windows-1252 code points aren't unique
>> anymore. Every code point in the range 0x80 - 0x9F exists somewhere else,
>> too. So my estimation would be that it will cause more trouble than it might
>> solve.
>>
>
> Agree.
> I see two different problems here:
> 1) absence of explicit encoding information in external data
> 2) existence of a canonical representation which can be easily compared...
>
> Generalization of UTF8 should solve 1 (slowly with lot of inertia),
> then we can simply assume implicit=UTF8.
> Unicode could solve 2...
> ...Well, as long as diacriticals are ignored.
> To me Unicode still has problems with:
> (String with: 16r61 asCharacter with: 16r0302 asCharacter) = (String
> with: 16rE2 asCharacter)
>

Oh well, I forgot about this. There are less chances to get this right without changing a lot of stuff. In my opinion Character has to go the way of the SmallInteger. If the world will be unicode centric than a character needs to be a sequence of code points. A character that has only one code point will be the special case that needs to be optimized, that will resemble what Character how it is now.
Having those sequences you will still need to have a table that states the equality of the code point sequence and the 8bit equivalent of e.g. â. But this is due to the western centric specification of unicode. And we have to live with that.

Another 2 cents,

Norbert

> Nicolas
>
>> Squeak clearly uses CP1252.
>> For Pharo, there might be a mix of the two since Sophie-like
>> refactorings. Surely what John was refering to.
>>
>> In pharo the 20AC string gives me a euro sign but the 80 hex one prints a
>> rectangle which is _a_ interpretation of '?' ;)
>> Norbert
>>
>> _______________________________________________
>>
>> Pharo-project mailing list
>>
>> [hidden email]
>>
>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>
>>
>> _______________________________________________
>> Pharo-project mailing list
>> [hidden email]
>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>
>>
>> _______________________________________________
>> Pharo-project mailing list
>> [hidden email]
>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Philippe Marschall-2

Re: squeakToUTF-8 and related?

In reply to this post by Nicolas Cellier

Nicolas Cellier wrote:

> 2010/3/28 Stéphane Ducasse <[hidden email]>:
>>> You should ask Sophie team, their knowledge certainly is far more
>>> advanced than mine.
>> The problem is that most of them disappeared after the java rewrite announce.
>>
>>> String should be a SequenceableCollection of Character.
>>> Internally, for space/speed reasons they rather store a code
>>> representing the value of a Character.
>>>
>>> In a simple model, this value would be the Unicode encoding...
>>> In squeak, only lowest 22 bits of a Character value are used to encode
>>> the character (#charCode).
>>> Bits of rank 23 to 30 encode a so called #leadingChar.
>>> I guess we stopped at bit #30 just to be sure to handle SmallInteger values.
>>> Don't count on me to explain leadingChar, I can't...
>> :)
>> I read the comment of the class and some code and got lost
>>
>>> For leadingChar ~~ 0, i'm not event sure of correct charCode interpretation...
>>>
>>> For value < 256, the interpretation of the charCode is not exactly unicode...
>>> It's more CP1252 (with assigned values to codes from 128 to 159).
>>>
>>> Once upon a time, it used to be Mac Roman encoding instead...
>>> Let's forget the past (but you could so some remnants in old code).
>>>
>>> ------------------------------
>>>
>>> When marshalling/unmarshalling strings to/from outside world we
>>> could/should use ByteArray...
>>> Unwisely, we don't.
>>> Instead, we reuse a String as storage for these codes.
>>> As a result, you see all these squeakToUtf8, utf8ToSqueak etc...
>>> That means that the contents of the String cannot be interpreted
>>> outside of its context... Very very bad IMHO.
>>> Under this point of view, the String has no more a self-contained
>>> meaning, but is just a blob of codes (on 8 or 32 bits).
>>> Fortunately, we mostly use these forms for temporary storage, but
>>> even, I don't like it.
>>>
>>> There are other alternatives like defining subclasses of String that
>>> encapsulate their encodings and know how to be well behaved Strings,
>>> not just context dependent blobs.
>>> For example, you could as well define an UT8String.
>>> VW went on this kind of path long time ago (not sure for utf8 though).
>>>
>>> Well, I'm not sure whether I succeeded in explaining something at all
>>> or just added confusion...
>> don;t worry.
>> for the seaside book I started to read unicode standard and history now it would be good to
>> know what to do and do it :)
>>
>
> Ask Seaside folks, they certainly have some ideas.

I don't like the #squeakToXXX methods because for every encoding you
support you need to add a method. That's why I prefer the
#convertToEncoding: method, one method for every use case.

I like even less the #xxxToSquak methods, the #convertFromEnoding:
method does the same. In addition you have the fact that you're dealing
with strings not in the native Squeak format. If you pass them anywhere
you're unlikely to get the expected result. I prefer ByteArrays for this
use case which have no semantic and make it clear that it's not a native
string.

Cheers
Philippe

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Stéphane Ducasse

Re: squeakToUTF-8 and related?

In reply to this post by Henrik Sperre Johansen

On Mar 29, 2010, at 11:10 AM, Henrik Johansen wrote:

>
> On Mar 28, 2010, at 4:36 13PM, Stéphane Ducasse wrote:
>
>> Hi
>>
>> I'm trying to remember the situation with the internal representation of string in pharo/squeak
>> to revise http://book.seaside.st/book/in-action/serving-files/character-encodings/seaside-pharo
>>
>> I saw that in pharo we have this NonASCIIMap. I do not remember what have been done in pharo.
>> Argh memory leaks.... Nicolas do you remember the situation?
> NonASCIIMap is used for quickly determining whether the string with no character codes > 127 (ie only ascii characters).
> It's very useful for doing primitive accellerated isAsciiString, which in the case of ascii-compatible encodings (utf8, latin1, macroman, etc.) would mean no conversion is required for it to be the "appropriate" internal bytestring format.
> It's used f.ex. in the nextChunk code,

ok thanks

> Strangely it is also used in FileStream writeSourceCodeFrom: baseName: isSt: , for some reason we there use a MacRoman if stream contents isAscii, which really makes no sense, but whatever.

ok may be levente fixed that in Squeak.

>
> John pointed out some converters were lying, I'm not entirely sure that's true anymore, what IS certain though, is the external code format used is inconsistent, depending on from where/how you save/load it.

May be we should wrtie some tests to know what to fix.

> It really should be cleaned up to always store in utf8, and possibly also latin1 if possible.
> All this should be cleared up to always try reading as UTF8, then raising an InvalidUTF8 error which can be handled by telling it to use a different converter and restart.

ok

> Possibly chosen from a menu when dropping a file on image, or choosing an alternative automatically if we know the possible other encodings a file could have been saved as, not sure how to best do it for scripts given as parameters when launching the vm
>
> On the font rendering side, I agree with Nicolas it's too complicated doing font rendering in-image, FT is an ok compromise though.
> As for the bitmap strikefont rendering, what is really needed is a way to specify the charset it represents, and mappings from the internal string encodings to its glyphs.
> F.ex., Bitmap DejaVu is really latin15, so it will currently render some ByteString characters incorrectly, as well as render some Unicode chars it really has glyphs for as ?. (such as the euro sign)
>
> Which all really has nothing to do with your initial question :)

no problem I like to learn.

> The internal representation of strings really hasn't changed since it was written, with the exception that leadingChar for WideStrings are now zero.
> As far as I can tell, that means the interal storage format of widestrings is now equivalent to utf32, not sure what Byte Order it uses though, or if that is even consistent across platforms. :)
>
> The point about using WaKomEncoded, and passing all strings going into/out of the image through an encoder is still valid.
>
> Cheers,
> Henry
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Stéphane Ducasse

Re: squeakToUTF-8 and related?

In reply to this post by Michael Rueger-6

>
> And I think so does the rest. In Pharo, not necessarily Squeak...
>
> Part of the confusing discussion on the Pharo(!) list might be that you were actually talking about the current state of Squeak.
>
> There was a major UTF8 cleanup for Pharo and messages like macToSqueak shouldn't even exist in Pharo anymore.
>
> Michael

OK this is what I wanted to know.
So mike could you summarize the Pharo situation.

Now I would love to have some tests.

Stef
_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Michael Rueger-6

Re: squeakToUTF-8 and related?

On 4/1/2010 10:38 AM, Stéphane Ducasse wrote:

> So mike could you summarize the Pharo situation.

Until someone proves me wrong I would say that Pharo is UTF-8 clean :-)

Michael

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Henrik Sperre Johansen

Re: squeakToUTF-8 and related?

http://code.google.com/p/pharo/issues/detail?id=1608

Den 1. apr. 2010 kl. 11.31 skrev Michael Rueger <[hidden email]>:

> On 4/1/2010 10:38 AM, Stéphane Ducasse wrote:
>
>> So mike could you summarize the Pharo situation.
>
> Until someone proves me wrong I would say that Pharo is UTF-8
> clean :-)
>
>
> Michael
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Stéphane Ducasse

Re: squeakToUTF-8 and related?

In reply to this post by Michael Rueger-6

On Apr 1, 2010, at 11:31 AM, Michael Rueger wrote:

> On 4/1/2010 10:38 AM, Stéphane Ducasse wrote:
>
>> So mike could you summarize the Pharo situation.
>
> Until someone proves me wrong I would say that Pharo is UTF-8 clean :-)

Sure but what does it mean for stef the stupid:
characters/strings are encoded in UTF-8 or optimized -> 127 and then after
If you could just rwrtie a little paragraph that I understand once :)

>
>
> Michael
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Michael Rueger-6

Re: squeakToUTF-8 and related?

On 4/1/2010 3:26 PM, Stéphane Ducasse wrote:

>> Until someone proves me wrong I would say that Pharo is UTF-8 clean :-)
>
> Sure but what does it mean for stef the stupid:
> characters/strings are encoded in UTF-8 or optimized -> 127 and then after
> If you could just rwrtie a little paragraph that I understand once :)

I'll try :-)

And I should have written unicode clean, not UTF-8.

So modulo bugs like the one Henrik pointed out Pharo
- keeps all strings in the image in unicode. Either as byte strings for
strings that do not contain any characters larger than 127, WideString
otherwise using basically UTF-32 encoding.
- has all en/decoders fixed to do the correct *-encoding to unicode and
back translation
- utilizes the unicode character entry in the input events, so it should
be possible to input all unicode characters on the different keyboards
(us, german, french, russian, etc)
- uses unicode encoding for filenames
- uses unicode encoding for the clipboard

Hope I didn't leave out anything important :-)

You still need to pick the correct en/decoder to interpret file contents
correctly, the system just can't know which encoding the file is in (see
e.g. text edit on the mac, you need to set the proper encoding there as
well).

Michael

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Henrik Sperre Johansen

Re: squeakToUTF-8 and related?

On Apr 1, 2010, at 3:47 02PM, Michael Rueger wrote:

> On 4/1/2010 3:26 PM, Stéphane Ducasse wrote:
>
>>> Until someone proves me wrong I would say that Pharo is UTF-8 clean :-)
>>
>> Sure but what does it mean for stef the stupid:
>> characters/strings are encoded in UTF-8 or optimized -> 127 and then after
>> If you could just rwrtie a little paragraph that I understand once :)
>
> I'll try :-)
>
> And I should have written unicode clean, not UTF-8.
>
> So modulo bugs like the one Henrik pointed out Pharo
> - keeps all strings in the image in unicode. Either as byte strings for strings that do not contain any characters larger than 127, WideString otherwise using basically UTF-32 encoding.
> - has all en/decoders fixed to do the correct *-encoding to unicode and back translation
> - utilizes the unicode character entry in the input events, so it should be possible to input all unicode characters on the different keyboards (us, german, french, russian, etc)
> - uses unicode encoding for filenames
> - uses unicode encoding for the clipboard
>
> Hope I didn't leave out anything important :-)

I'd like to add
"Other means of importing/exporting code", ie. for mcz., .st, .cs, reading/writing logic.
This is the place to me there seems to still be shady areas in Pharo. Mostly because they seem to assume different encodings for non-utf8 readable input.

Also, I don't really mind we keep some strings as latin1 when possible, as they tend to be a lot faster to process, and the conversion between them and WideStrings is trivial, and as far as I can tell, reliable.

Cheers,
Henry

PS. The other issues with rendering absolutely apply to Pharo, but not quite related to string encoding per se.
_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Stéphane Ducasse

Re: squeakToUTF-8 and related?

In reply to this post by Michael Rueger-6

>>
>
> I'll try :-)

Thanks I appreciate your effort to educate me. :)

> And I should have written unicode clean, not UTF-8.
>
> So modulo bugs like the one Henrik pointed out Pharo
> - keeps all strings in the image in unicode. Either as byte strings for strings that do not contain any characters larger than 127, WideString otherwise using basically UTF-32 encoding.
> - has all en/decoders fixed to do the correct *-encoding to unicode and back translation
> - utilizes the unicode character entry in the input events, so it should be possible to input all unicode characters on the different keyboards (us, german, french, russian, etc)
> - uses unicode encoding for filenames
> - uses unicode encoding for the clipboard
>
> Hope I didn't leave out anything important :-)
>
> You still need to pick the correct en/decoder to interpret file contents correctly, the system just can't know which encoding the file is in (see e.g. text edit on the mac, you need to set the proper encoding there as well).
>
> Michael
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Levente Uzonyi-2

Re: squeakToUTF-8 and related?

In reply to this post by Michael Rueger-6

On Thu, 1 Apr 2010, Michael Rueger wrote:

This sounds really inefficient. Did you remove the primitive send from
ByteString >> #at:put:, or the following works, breaking the above
constraint?

(ByteString basicNew: 1) at: 1 put: (Character value: 128)

Levente

> otherwise using basically UTF-32 encoding.
> - has all en/decoders fixed to do the correct *-encoding to unicode and back
> translation
> - utilizes the unicode character entry in the input events, so it should be
> possible to input all unicode characters on the different keyboards (us,
> german, french, russian, etc)
> - uses unicode encoding for filenames
> - uses unicode encoding for the clipboard
>
> Hope I didn't leave out anything important :-)
>
> You still need to pick the correct en/decoder to interpret file contents
> correctly, the system just can't know which encoding the file is in (see e.g.
> text edit on the mac, you need to set the proper encoding there as well).
>
> Michael
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Michael Rueger-6

Re: squeakToUTF-8 and related?

On 4/3/2010 1:09 PM, Levente Uzonyi wrote:

> This sounds really inefficient. Did you remove the primitive send from
> ByteString >> #at:put:, or the following works, breaking the above
> constraint?
>
> (ByteString basicNew: 1) at: 1 put: (Character value: 128)

Apologies, it's actually 256 of course. Confused it with ASCII/Latin1
compatibility.

(ByteString basicNew: 1) at: 1 put: (Character value: 256); yourself

will return a WideString.

Michael

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project