http://stackoverflow.com/questions/16645848/squeak-monticello-character-encoding
Let's kill this one, it's totally insane |
First thing would be to simplify #setConverterForCode and #selectTextConverterForCode. Do we still want to use a MacRomanTextConverter, seriously? I'm not even sure I've got that many files with that encoding on my Mac-OSX... Do we really need to put a ByteOrderMark for UTF-8, seriously? See http://en.wikipedia.org/wiki/Byte_order_mark, it's valueless, and not recommended. It were a Squeak way to specify that a Squeak source file would use UTF-8 rather than MacRoman, but now this should be obsolescent. 2013/5/22 Nicolas Cellier <[hidden email]>
|
In reply to this post by Nicolas Cellier
+1
|
In reply to this post by Nicolas Cellier
Norbert
|
That sounds good. We could even try to fallback to UT-32 if we encounter zeros (but his should be very rare...). For write, ZipArchive are un-aware of any encoding... They use latin1.
In Squeak, I could place some squeakToUTF8 sends in MCMczWriter, and equivalent UTF8TextConverter in Pharo #serializeDefinitions:, maybe this is needed in some other serialize* (version, dependencies who knows...)
2013/5/22 Norbert Hartl <[hidden email]>
|
In reply to this post by Nicolas Cellier
On Wed, May 22, 2013 at 2:16 PM, Nicolas Cellier
<[hidden email]> wrote: > First thing would be to simplify #setConverterForCode and > #selectTextConverterForCode. > Do we still want to use a MacRomanTextConverter, seriously? I'm not even > sure I've got that many files with that encoding on my Mac-OSX... > Do we really need to put a ByteOrderMark for UTF-8, seriously? See > http://en.wikipedia.org/wiki/Byte_order_mark, it's valueless, and not > recommended. It were a Squeak way to specify that a Squeak source file would > use UTF-8 rather than MacRoman, but now this should be obsolescent. Old code was certainly in MacRoman, and quite a few used middle dot, accented chars and other characters in the right half of the character chart. Monticello surely should use UTF-8. I'd think, though, it should keep BOM; did you encounter any problems? (it is not recommended, but it is permitted.) -- -- Yoshiki |
MC never wrote a BOM, so we don't have to be compatible with BOM. If we can simplify the process, let's simplify, because maintaining useless
compatibility costs, the code is really crooked by now, and this leads
to mis-understanding, and soon to broken features and noise. Currently, snapshot/source.st IS broken. If there are codes > 127, the UTF8TextConverter will most likely fail, and I like the idea of Norbert to retry with a legacy encoding. This way, we put crooked compatibility layer in exceptional handling. This will also simplify the MC readers/writers in VW, gst, Gemstone, ... Even for the legacy code, I wonder if MacRoman would be the right choice. MC never encoded the strings and always wrote the codes as is. So, setEncoderForCode is here for maintaining compatibility with MC snapshot/source.st written from an old image where internal String encoding was MacRoman - when was it, 3.7? Are there really many of these? I bet 99% of MC-files are encoded in latin-1 but decoded with MacRoman if we go through a MczInstaller... Of course, MC now uses snapshot.bin rather than snapshot/source.st. Did old versions of MC failed to write snapshot.bin? Eventually, we can set a Preferences in Squeak for ultra old legacy encoding (not in Pharo, I guess Pharo should not care at all).
2013/5/23 Yoshiki Ohshima <[hidden email]>
|
On Wed, May 22, 2013 at 3:57 PM, Nicolas Cellier
<[hidden email]> wrote: > MC never wrote a BOM, so we don't have to be compatible with BOM. > > If we can simplify the process, let's simplify, because maintaining useless > compatibility costs, the code is really crooked by now, and this leads to > mis-understanding, and soon to broken features and noise. Currently, > snapshot/source.st IS broken. For a long time, yes. > If there are codes > 127, the UTF8TextConverter will most likely fail, and I > like the idea of Norbert to retry with a legacy encoding. This way, we put > crooked compatibility layer in exceptional handling. > > This will also simplify the MC readers/writers in VW, gst, Gemstone, ... > > Even for the legacy code, I wonder if MacRoman would be the right choice. MC > never encoded the strings and always wrote the codes as is. Right. I now remember the pain. > So, setEncoderForCode is here for maintaining compatibility with MC > snapshot/source.st written from an old image where internal String encoding > was MacRoman - when was it, 3.7? Are there really many of these? > > I bet 99% of MC-files are encoded in latin-1 but decoded with MacRoman if we > go through a MczInstaller... > > Of course, MC now uses snapshot.bin rather than snapshot/source.st. > Did old versions of MC failed to write snapshot.bin? > > Eventually, we can set a Preferences in Squeak for ultra old legacy encoding > (not in Pharo, I guess Pharo should not care at all). For Pharo, I'd guess so, too. (I heard that the Japanese support is pretty much dropped in Pharo.) -- -- Yoshiki |
In reply to this post by Nicolas Cellier
+ 1000
:) On May 22, 2013, at 11:16 PM, Nicolas Cellier <[hidden email]> wrote:
|
In reply to this post by Yoshiki Ohshima-3
>>
> > For Pharo, I'd guess so, too. > > (I heard that the Japanese support is pretty much dropped in Pharo.) Probably because none of us has the knowledge to understand how it works and see the problems. > > -- > -- Yoshiki > |
In reply to this post by Nicolas Cellier
On 23.05.2013 00:06, Nicolas Cellier
wrote:
A BOM does make sense in this case, explicitly to act as a legacy switch, just like how the current .st works. Notice any non-ascii files will (should) be written in utf8 after a switch, just as for regular .st. Do we need to use a MacRomanTextConverter? No, we don't, not for MC. For regular .st MacRoman makes sense, since that was the encoding using for ByteStrings before WideStrings were invented, and as such, any legacy non-ascii .st sources written would be in that encoding. For MC .st files, the equivalent fallback legacy encoding for maximum compatability would be Latin1. End result: Written by Readable in Old Readable in New Old ascii: yes yes Old non-ascii, <256 yes yes Old non-ascii, >256 no no New ascii: yes yes New non-ascii: no yes Find attached the minor changeset I wrote a long time ago doing this, IIRC, the main problems remaining were how to handle writing source to internal streams, for which you can't specify encoding; this is done in quite a few tests. Not sure if it's better to just disallow that case, or another approach should be taken. Cheers, Henry MonticelloWideStrings.2.zip (1K) Download Attachment |
In reply to this post by Yoshiki Ohshima-3
On 23.05.2013 03:59, Yoshiki Ohshima wrote:
> For Pharo, I'd guess so, too. > > (I heard that the Japanese support is pretty much dropped in Pharo.) > > -- If you mean using leadingChar to indicate WideStrings using non-unicode encoding, and Bitmapped fonts with multiple sets for non-latin1 ranges taking it into account when selecting glyph, then yes, it's pretty much been dropped. If you have a TrueType font with unicode -> japanese glyphs, it should still work though. Cheers, Henry |
In reply to this post by Nicolas Cellier
On 23.05.2013 00:06, Nicolas Cellier wrote:
> That sounds good. We could even try to fallback to UT-32 if we > encounter zeros (but his should be very rare...). > > For write, ZipArchive are un-aware of any encoding... They use latin1. > In Squeak, I could place some squeakToUTF8 sends in MCMczWriter, and > equivalent UTF8TextConverter in Pharo #serializeDefinitions:, maybe > this is needed in some other serialize* (version, dependencies who > knows...) That won't work, if the file contained sources for both widestring and bytestring sourced methods. In which case the file would contain code stored BOTH as latin1 bytes, and (same endianness as platform saved from) UTF32. Which means you'd have to detect and handle jumps back and forth in encoding when reading... IMHO, just consider those files lost beyond hope. Cheers, Henry |
The snapshot/source.st does not contain a mix of ByteString and WideString because a single String is written during the process (all code is written into a String new writeStream which will make the String wide at first wide Character), so it should work. 2013/5/23 Henrik Sperre Johansen <[hidden email]>
|
On 23.05.2013 12:52, Nicolas Cellier
wrote:
You're right, I mixed it with something (not quite sure what anymore) else... It's been a while since I last looked at it, as indicated by the .cs timestamp :) Cheers, Henry |
You know. I'm jealous.
I'm jealous because I have to fight hard to concentrate and to get focused. So this is cool to see that some of you can. I have to handle so many tihngs that are still important to do… anyway keep get focused. Stef On May 23, 2013, at 4:19 PM, Henrik Sperre Johansen <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |