Smalltalk › Squeak › Squeak - Dev

MC should really write snaphsot/source.st in UTF8

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

15 messages Options

Nicolas Cellier

MC should really write snaphsot/source.st in UTF8

http://stackoverflow.com/questions/16645848/squeak-monticello-character-encoding
Let's kill this one, it's totally insane

Nicolas Cellier

Re: MC should really write snaphsot/source.st in UTF8

First thing would be to simplify #setConverterForCode and #selectTextConverterForCode.
Do we still want to use a MacRomanTextConverter, seriously? I'm not even sure I've got that many files with that encoding on my Mac-OSX...

Do we really need to put a ByteOrderMark for UTF-8, seriously? See http://en.wikipedia.org/wiki/Byte_order_mark, it's valueless, and not recommended. It were a Squeak way to specify that a Squeak source file would use UTF-8 rather than MacRoman, but now this should be obsolescent.

2013/5/22 Nicolas Cellier <[hidden email]>

http://stackoverflow.com/questions/16645848/squeak-monticello-character-encoding
Let's kill this one, it's totally insane

Tobias Pape

Re: MC should really write snaphsot/source.st in UTF8

Fun fact:

Having one “wide” character somewhere in one method or comment and

Filing out an mcz, the sources.st becomes a persisted WideString (utf16?)

And you won't know it...

Am 22.05.2013 um 23:16 schrieb Nicolas Cellier <[hidden email]>:

First thing would be to simplify #setConverterForCode and #selectTextConverterForCode.
Do we still want to use a MacRomanTextConverter, seriously? I'm not even sure I've got that many files with that encoding on my Mac-OSX...

Do we really need to put a ByteOrderMark for UTF-8, seriously? See http://en.wikipedia.org/wiki/Byte_order_mark, it's valueless, and not recommended. It were a Squeak way to specify that a Squeak source file would use UTF-8 rather than MacRoman, but now this should be obsolescent.

2013/5/22 Nicolas Cellier <[hidden email]>

http://stackoverflow.com/questions/16645848/squeak-monticello-character-encoding
Let's kill this one, it's totally insane

Nicolas Cellier

Re: MC should really write snaphsot/source.st in UTF8

Yes, it's UTF-32BE, see SO post. And you get a bonus point if you can find by what magic this happens without tracing in a Debugger ;)

2013/5/23 Tobias Pape <[hidden email]>

Fun fact:
Having one “wide” character somewhere in one method or comment and
Filing out an mcz, the sources.st becomes a persisted WideString (utf16?)
And you won't know it...

Am 22.05.2013 um 23:16 schrieb Nicolas Cellier <[hidden email]>:

First thing would be to simplify #setConverterForCode and #selectTextConverterForCode.
Do we still want to use a MacRomanTextConverter, seriously? I'm not even sure I've got that many files with that encoding on my Mac-OSX...

Do we really need to put a ByteOrderMark for UTF-8, seriously? See http://en.wikipedia.org/wiki/Byte_order_mark, it's valueless, and not recommended. It were a Squeak way to specify that a Squeak source file would use UTF-8 rather than MacRoman, but now this should be obsolescent.

2013/5/22 Nicolas Cellier <[hidden email]>

http://stackoverflow.com/questions/16645848/squeak-monticello-character-encoding
Let's kill this one, it's totally insane

Nicolas Cellier

Fwd: [Pharo-dev] MC should really write snaphsot/source.st in UTF8

In reply to this post by Nicolas Cellier

---------- Forwarded message ----------
From: Nicolas Cellier <[hidden email]>
Date: 2013/5/23
Subject: Re: [Pharo-dev] MC should really write snaphsot/source.st in UTF8
To: Discusses Development of Pharo <[hidden email]>

That sounds good. We could even try to fallback to UT-32 if we encounter zeros (but his should be very rare...).

For write, ZipArchive are un-aware of any encoding... They use latin1.

In Squeak, I could place some squeakToUTF8 sends in MCMczWriter, and equivalent UTF8TextConverter in Pharo #serializeDefinitions:, maybe this is needed in some other serialize* (version, dependencies who knows...)

2013/5/22 Norbert Hartl <[hidden email]>

Am 22.05.2013 um 23:16 schrieb Nicolas Cellier <[hidden email]>:

First thing would be to simplify #setConverterForCode and #selectTextConverterForCode.
Do we still want to use a MacRomanTextConverter, seriously? I'm not even sure I've got that many files with that encoding on my Mac-OSX...

Do we really need to put a ByteOrderMark for UTF-8, seriously? See http://en.wikipedia.org/wiki/Byte_order_mark, it's valueless, and not recommended. It were a Squeak way to specify that a Squeak source file would use UTF-8 rather than MacRoman, but now this should be obsolescent.

A BOM for utf-8 does not make sense. It could act as a switch between legacy encoding and utf-8. But it would also be a decision that will be regretted shortly after. Most files in monticello are 7bit so there wouldn't be a problem changing the default encoding. For every other file an exception will be thrown. So reading utf-8 and on exception reading the same thing in legacy might be a way to go.

Norbert

2013/5/22 Nicolas Cellier <[hidden email]>

http://stackoverflow.com/questions/16645848/squeak-monticello-character-encoding
Let's kill this one, it's totally insane

Yoshiki Ohshima-3

Re: MC should really write snaphsot/source.st in UTF8

In reply to this post by Nicolas Cellier

On Wed, May 22, 2013 at 2:16 PM, Nicolas Cellier
<[hidden email]> wrote:
> First thing would be to simplify #setConverterForCode and
> #selectTextConverterForCode.
> Do we still want to use a MacRomanTextConverter, seriously? I'm not even
> sure I've got that many files with that encoding on my Mac-OSX...
> Do we really need to put a ByteOrderMark for UTF-8, seriously? See
> http://en.wikipedia.org/wiki/Byte_order_mark, it's valueless, and not
> recommended. It were a Squeak way to specify that a Squeak source file would
> use UTF-8 rather than MacRoman, but now this should be obsolescent.

Old code was certainly in MacRoman, and quite a few used middle dot,
accented chars and other characters in the right half of the character
chart.

Monticello surely should use UTF-8. I'd think, though, it should keep
BOM; did you encounter any problems? (it is not recommended, but it
is permitted.)

--
-- Yoshiki

Tobias Pape

Re: MC should really write snaphsot/source.st in UTF8

In reply to this post by Nicolas Cellier

Am 23.05.2013 um 00:11 schrieb Nicolas Cellier <[hidden email]>:

Yes, it's UTF-32BE, see SO post. And you get a bonus point if you can find by what magic this happens without tracing in a Debugger ;)

That one might be easy, as it bit me several times.

First, the stream that is uses for sources.st is backed by a simple ByteString, but

during the writing of the definitions, comments, etc, once you hit a method with a wide

Character, its source is a WideString, and putting that onto the stream makes

the underlying string #become a WideString, too. Which will then be written us g utf32be (as i just learned).

2013/5/23 Tobias Pape <[hidden email]>

Fun fact:
Having one “wide” character somewhere in one method or comment and
Filing out an mcz, the sources.st becomes a persisted WideString (utf16?)
And you won't know it...

Am 22.05.2013 um 23:16 schrieb Nicolas Cellier <[hidden email]>:

First thing would be to simplify #setConverterForCode and #selectTextConverterForCode.
Do we still want to use a MacRomanTextConverter, seriously? I'm not even sure I've got that many files with that encoding on my Mac-OSX...

Do we really need to put a ByteOrderMark for UTF-8, seriously? See http://en.wikipedia.org/wiki/Byte_order_mark, it's valueless, and not recommended. It were a Squeak way to specify that a Squeak source file would use UTF-8 rather than MacRoman, but now this should be obsolescent.

2013/5/22 Nicolas Cellier <[hidden email]>

http://stackoverflow.com/questions/16645848/squeak-monticello-character-encoding
Let's kill this one, it's totally insane

Nicolas Cellier

Re: MC should really write snaphsot/source.st in UTF8

In reply to this post by Yoshiki Ohshima-3

MC never wrote a BOM, so we don't have to be compatible with BOM.

If we can simplify the process, let's simplify, because maintaining useless compatibility costs, the code is really crooked by now, and this leads to mis-understanding, and soon to broken features and noise. Currently, snapshot/source.st IS broken.

If there are codes > 127, the UTF8TextConverter will most likely fail, and I like the idea of Norbert to retry with a legacy encoding. This way, we put crooked compatibility layer in exceptional handling.

This will also simplify the MC readers/writers in VW, gst, Gemstone, ...

Even for the legacy code, I wonder if MacRoman would be the right choice. MC never encoded the strings and always wrote the codes as is.

So, setEncoderForCode is here for maintaining compatibility with MC snapshot/source.st written from an old image where internal String encoding was MacRoman - when was it, 3.7? Are there really many of these?

I bet 99% of MC-files are encoded in latin-1 but decoded with MacRoman if we go through a MczInstaller...

Of course, MC now uses snapshot.bin rather than snapshot/source.st.

Did old versions of MC failed to write snapshot.bin?

Eventually, we can set a Preferences in Squeak for ultra old legacy encoding (not in Pharo, I guess Pharo should not care at all).

2013/5/23 Yoshiki Ohshima <[hidden email]>

On Wed, May 22, 2013 at 2:16 PM, Nicolas Cellier
<[hidden email]> wrote:
> First thing would be to simplify #setConverterForCode and
> #selectTextConverterForCode.
> Do we still want to use a MacRomanTextConverter, seriously? I'm not even
> sure I've got that many files with that encoding on my Mac-OSX...
> Do we really need to put a ByteOrderMark for UTF-8, seriously? See
> http://en.wikipedia.org/wiki/Byte_order_mark, it's valueless, and not
> recommended. It were a Squeak way to specify that a Squeak source file would
> use UTF-8 rather than MacRoman, but now this should be obsolescent.

Old code was certainly in MacRoman, and quite a few used middle dot,
accented chars and other characters in the right half of the character
chart.

Monticello surely should use UTF-8. I'd think, though, it should keep
BOM; did you encounter any problems? (it is not recommended, but it
is permitted.)

--
-- Yoshiki

Nicolas Cellier

Re: MC should really write snaphsot/source.st in UTF8

In reply to this post by Tobias Pape

Yes, this is the easy part, I was speaking of how the hell we can write a 32bits Word-oriented collection into a Byte-oriented stream and magically end up with ut-32be ;)

2013/5/23 Tobias Pape <[hidden email]>

Am 23.05.2013 um 00:11 schrieb Nicolas Cellier <[hidden email]>:

Yes, it's UTF-32BE, see SO post. And you get a bonus point if you can find by what magic this happens without tracing in a Debugger ;)

That one might be easy, as it bit me several times.
First, the stream that is uses for sources.st is backed by a simple ByteString, but
during the writing of the definitions, comments, etc, once you hit a method with a wide

Character, its source is a WideString, and putting that onto the stream makes
the underlying string #become a WideString, too. Which will then be written us g utf32be (as i just learned).

2013/5/23 Tobias Pape <[hidden email]>

Fun fact:
Having one “wide” character somewhere in one method or comment and
Filing out an mcz, the sources.st becomes a persisted WideString (utf16?)
And you won't know it...

Am 22.05.2013 um 23:16 schrieb Nicolas Cellier <[hidden email]>:

First thing would be to simplify #setConverterForCode and #selectTextConverterForCode.
Do we still want to use a MacRomanTextConverter, seriously? I'm not even sure I've got that many files with that encoding on my Mac-OSX...

Do we really need to put a ByteOrderMark for UTF-8, seriously? See http://en.wikipedia.org/wiki/Byte_order_mark, it's valueless, and not recommended. It were a Squeak way to specify that a Squeak source file would use UTF-8 rather than MacRoman, but now this should be obsolescent.

2013/5/22 Nicolas Cellier <[hidden email]>

http://stackoverflow.com/questions/16645848/squeak-monticello-character-encoding
Let's kill this one, it's totally insane

Yoshiki Ohshima-3

Re: MC should really write snaphsot/source.st in UTF8

In reply to this post by Nicolas Cellier

On Wed, May 22, 2013 at 3:57 PM, Nicolas Cellier
<[hidden email]> wrote:
> MC never wrote a BOM, so we don't have to be compatible with BOM.
>
> If we can simplify the process, let's simplify, because maintaining useless
> compatibility costs, the code is really crooked by now, and this leads to
> mis-understanding, and soon to broken features and noise. Currently,
> snapshot/source.st IS broken.

For a long time, yes.

> If there are codes > 127, the UTF8TextConverter will most likely fail, and I
> like the idea of Norbert to retry with a legacy encoding. This way, we put
> crooked compatibility layer in exceptional handling.
>
> This will also simplify the MC readers/writers in VW, gst, Gemstone, ...
>
> Even for the legacy code, I wonder if MacRoman would be the right choice. MC
> never encoded the strings and always wrote the codes as is.

Right. I now remember the pain.

> So, setEncoderForCode is here for maintaining compatibility with MC
> snapshot/source.st written from an old image where internal String encoding
> was MacRoman - when was it, 3.7? Are there really many of these?
>
> I bet 99% of MC-files are encoded in latin-1 but decoded with MacRoman if we
> go through a MczInstaller...
>
> Of course, MC now uses snapshot.bin rather than snapshot/source.st.
> Did old versions of MC failed to write snapshot.bin?
>
> Eventually, we can set a Preferences in Squeak for ultra old legacy encoding
> (not in Pharo, I guess Pharo should not care at all).

For Pharo, I'd guess so, too.

(I heard that the Japanese support is pretty much dropped in Pharo.)

--
-- Yoshiki

Stéphane Ducasse

Re: [Pharo-dev] MC should really write snaphsot/source.st in UTF8

In reply to this post by Nicolas Cellier

+ 1000

On May 22, 2013, at 11:16 PM, Nicolas Cellier <[hidden email]> wrote:

First thing would be to simplify #setConverterForCode and #selectTextConverterForCode.
Do we still want to use a MacRomanTextConverter, seriously? I'm not even sure I've got that many files with that encoding on my Mac-OSX…

;)

Do we really need to put a ByteOrderMark for UTF-8, seriously? See http://en.wikipedia.org/wiki/Byte_order_mark, it's valueless, and not recommended. It were a Squeak way to specify that a Squeak source file would use UTF-8 rather than MacRoman, but now this should be obsolescent.

2013/5/22 Nicolas Cellier <[hidden email]>

http://stackoverflow.com/questions/16645848/squeak-monticello-character-encoding
Let's kill this one, it's totally insane

stephane ducasse

Re: [Pharo-dev] [squeak-dev] Re: MC should really write snaphsot/source.st in UTF8

In reply to this post by Yoshiki Ohshima-3

>>
>
> For Pharo, I'd guess so, too.
>
> (I heard that the Japanese support is pretty much dropped in Pharo.)

Probably because none of us has the knowledge to understand how it works and see the problems.

>
> --
> -- Yoshiki
>

Tobias Pape

Re: MC should really write snaphsot/source.st in UTF8

In reply to this post by Nicolas Cellier

At least it does not switch encodings mid-stream, this would be the absolute nightmare...

Am 23.05.2013 um 01:00 schrieb Nicolas Cellier <[hidden email]>:

Yes, this is the easy part, I was speaking of how the hell we can write a 32bits Word-oriented collection into a Byte-oriented stream and magically end up with ut-32be ;)

2013/5/23 Tobias Pape <[hidden email]>

Am 23.05.2013 um 00:11 schrieb Nicolas Cellier <[hidden email]>:

Yes, it's UTF-32BE, see SO post. And you get a bonus point if you can find by what magic this happens without tracing in a Debugger ;)

That one might be easy, as it bit me several times.
First, the stream that is uses for sources.st is backed by a simple ByteString, but
during the writing of the definitions, comments, etc, once you hit a method with a wide

Character, its source is a WideString, and putting that onto the stream makes
the underlying string #become a WideString, too. Which will then be written us g utf32be (as i just learned).

2013/5/23 Tobias Pape <[hidden email]>

Fun fact:
Having one “wide” character somewhere in one method or comment and
Filing out an mcz, the sources.st becomes a persisted WideString (utf16?)
And you won't know it...

Am 22.05.2013 um 23:16 schrieb Nicolas Cellier <[hidden email]>:

First thing would be to simplify #setConverterForCode and #selectTextConverterForCode.
Do we still want to use a MacRomanTextConverter, seriously? I'm not even sure I've got that many files with that encoding on my Mac-OSX...

Do we really need to put a ByteOrderMark for UTF-8, seriously? See http://en.wikipedia.org/wiki/Byte_order_mark, it's valueless, and not recommended. It were a Squeak way to specify that a Squeak source file would use UTF-8 rather than MacRoman, but now this should be obsolescent.

2013/5/22 Nicolas Cellier <[hidden email]>

http://stackoverflow.com/questions/16645848/squeak-monticello-character-encoding
Let's kill this one, it's totally insane

Nicolas Cellier

Re: [Pharo-dev] MC should really write snaphsot/source.st in UTF8

In reply to this post by Nicolas Cellier

The snapshot/source.st does not contain a mix of ByteString and WideString because a single String is written during the process (all code is written into a String new writeStream which will make the String wide at first wide Character), so it should work.

2013/5/23 Henrik Sperre Johansen <[hidden email]>

On 23.05.2013 00:06, Nicolas Cellier wrote:

That sounds good. We could even try to fallback to UT-32 if we encounter zeros (but his should be very rare...).

For write, ZipArchive are un-aware of any encoding... They use latin1.
In Squeak, I could place some squeakToUTF8 sends in MCMczWriter, and equivalent UTF8TextConverter in Pharo #serializeDefinitions:, maybe this is needed in some other serialize* (version, dependencies who knows...)

That won't work, if the file contained sources for both widestring and bytestring sourced methods.
In which case the file would contain code stored BOTH as latin1 bytes, and (same endianness as platform saved from) UTF32.
Which means you'd have to detect and handle jumps back and forth in encoding when reading...
IMHO, just consider those files lost beyond hope.

Cheers,
Henry

Bert Freudenberg

Re: MC should really write snaphsot/source.st in UTF8

In reply to this post by Tobias Pape

On 2013-05-23, at 08:47, Tobias Pape <[hidden email]> wrote:

At least it does not switch encodings mid-stream, this would be the absolute nightmare...

It did, at one point, but we fixed that. Can't remember the details though.

- Bert -

Am 23.05.2013 um 01:00 schrieb Nicolas Cellier <[hidden email]>:

Yes, this is the easy part, I was speaking of how the hell we can write a 32bits Word-oriented collection into a Byte-oriented stream and magically end up with ut-32be ;)

2013/5/23 Tobias Pape <[hidden email]>

Am 23.05.2013 um 00:11 schrieb Nicolas Cellier <[hidden email]>:

Yes, it's UTF-32BE, see SO post. And you get a bonus point if you can find by what magic this happens without tracing in a Debugger ;)

That one might be easy, as it bit me several times.
First, the stream that is uses for sources.st is backed by a simple ByteString, but
during the writing of the definitions, comments, etc, once you hit a method with a wide

Character, its source is a WideString, and putting that onto the stream makes
the underlying string #become a WideString, too. Which will then be written us g utf32be (as i just learned).

2013/5/23 Tobias Pape <[hidden email]>

Fun fact:
Having one “wide” character somewhere in one method or comment and
Filing out an mcz, the sources.st becomes a persisted WideString (utf16?)
And you won't know it...

Am 22.05.2013 um 23:16 schrieb Nicolas Cellier <[hidden email]>:

First thing would be to simplify #setConverterForCode and #selectTextConverterForCode.
Do we still want to use a MacRomanTextConverter, seriously? I'm not even sure I've got that many files with that encoding on my Mac-OSX...

Do we really need to put a ByteOrderMark for UTF-8, seriously? See http://en.wikipedia.org/wiki/Byte_order_mark, it's valueless, and not recommended. It were a Squeak way to specify that a Squeak source file would use UTF-8 rather than MacRoman, but now this should be obsolescent.

2013/5/22 Nicolas Cellier <[hidden email]>

http://stackoverflow.com/questions/16645848/squeak-monticello-character-encoding
Let's kill this one, it's totally insane