Just so I don’t forget: this test fails (since at least 1.3). I have determined the correct value (175) independently with the respective tables and by using uconv.
testConversionToFrom self assert: (('äöü' convertToEncoding: 'mac-roman') convertFromEncoding: 'mac-roman') = 'äöü'. self assert: ((Character value: 216) asString convertToEncoding: 'mac-roman') equals: (Character value: 175) asString As far as I can tell, this flaw is not exclusive to the mac-roman encoding. However, latin1 and UTF8 are probably fine. Max |
Max,
> On 02 Mar 2016, at 19:57, Max Leske <[hidden email]> wrote: > > Just so I don’t forget: this test fails (since at least 1.3). I have determined the correct value (175) independently with the respective tables and by using uconv. > > testConversionToFrom > self assert: (('äöü' convertToEncoding: 'mac-roman') convertFromEncoding: 'mac-roman') = 'äöü'. > self > assert: ((Character value: 216) asString convertToEncoding: 'mac-roman') > equals: (Character value: 175) asString > > As far as I can tell, this flaw is not exclusive to the mac-roman encoding. However, latin1 and UTF8 are probably fine. > > Max That is because you are using the wrong encoders. | encoder string | encoder := #macroman asZnCharacterEncoder. string := 'äöü'. (encoder decodeBytes: (encoder encodeString: string)) = string. " => true" encoder encodeString: (Character value: 216) asString. " => #[175]" We should probably replace the old ones with the newer ones in Pharo 6. Regards, Sven |
2016-03-02 20:18 GMT+01:00 Sven Van Caekenberghe <[hidden email]>: Max, +1, why the hell maintaining two versions, one bogus? However, since this does not seem broken in Squeak, that makes a perfect example for investigating how and why such trivial error creeps in. |
All the ZnByteEncoders are automagically initialised from official specs, for MacRoman
self generateByteToUnicodeSpec: 'http://unicode.org/Public/MAPPINGS/VENDORS/APPLE/ROMAN.TXT' > On 02 Mar 2016, at 21:09, Nicolas Cellier <[hidden email]> wrote: > > > > 2016-03-02 20:18 GMT+01:00 Sven Van Caekenberghe <[hidden email]>: > Max, > > > On 02 Mar 2016, at 19:57, Max Leske <[hidden email]> wrote: > > > > Just so I don’t forget: this test fails (since at least 1.3). I have determined the correct value (175) independently with the respective tables and by using uconv. > > > > testConversionToFrom > > self assert: (('äöü' convertToEncoding: 'mac-roman') convertFromEncoding: 'mac-roman') = 'äöü'. > > self > > assert: ((Character value: 216) asString convertToEncoding: 'mac-roman') > > equals: (Character value: 175) asString > > > > As far as I can tell, this flaw is not exclusive to the mac-roman encoding. However, latin1 and UTF8 are probably fine. > > > > Max > > That is because you are using the wrong encoders. > > | encoder string | > encoder := #macroman asZnCharacterEncoder. > string := 'äöü'. > (encoder decodeBytes: (encoder encodeString: string)) = string. " => true" > encoder encodeString: (Character value: 216) asString. " => #[175]" > > We should probably replace the old ones with the newer ones in Pharo 6. > > Regards, > > Sven > > +1, why the hell maintaining two versions, one bogus? > > However, since this does not seem broken in Squeak, that makes a perfect example for investigating how and why such trivial error creeps in. > > |
2016-03-02 21:24 GMT+01:00 Sven Van Caekenberghe <[hidden email]>: All the ZnByteEncoders are automagically initialised from official specs, for MacRoman This is certainly good and robust practice, and I don't suggest removing the correct version! My question was more about the alternate library. Before removing it, may I ask how the hell did it get broken? It's not broken in Squeak and was not broken in pharo 1.4 for example. Doesn't tell it something?
|
In reply to this post by Nicolas Cellier
There was a bug in the old byteToUnicodeSpec tables (missing 2's in some of the latter entries, one of which should have been 16e02D8).
So initializeLatin1MapAndEncodings erroneously picked that over the correct D8. In a modern image, Sven has fixed the byteToUnicodeSpec (They're parsed directly from the Unicode site, so at least the hope is they are correct, at least), but the cache tables have not been updates, one can do this by executing: ByteTextConverter initialize. ByteTextConverter allSubclassesDo: [:each | each initializeLatin1MapAndEncodings]. Cheers, Henry |
If, by, "modern", we mean 1.4 onwards...
(The ByteTextConverter tables *were* updated it seems, so just MacRomanTextConverter initializeLatin1MapAndEncodings. (Character value: 216) asString convertToEncoding: 'mac-roman' should yield correct results) Cheers, Henry |
Not sure I'd say Squeak's (5.0 at least) MacRoman conversion is free of bugs either, at least the "legacy" ByteTextConverter subclass in Pharo passes the following:
"U+0152, Latin capital ligature OE is codepoint 16rCE in mac-roman" ((Character value: 16r0152) asString convertToEncoding: 'mac-roman') first charCode = 16rCE. "Codepoint 170 in MacRoman is TM sign, U+2122" ((Character value: 170) asString convertFromEncoding: 'mac-roman') first charCode = 16r2122. Cheers, Henry |
2016-03-02 23:16 GMT+01:00 Henrik Sperre Johansen <[hidden email]>: Not sure I'd say Squeak's (5.0 at least) MacRoman conversion is free of bugs Yes, you're right, it's because squeak tables did and still use CP1252 instead of ISO8859-L1 and thus do not match unicode. That might have made sense when porting from mac to windows while keeping ByteString, but at least since the switch to unicode that's bogus. I guess it's still here because some in image fonts would support cp1252 but I am too tired to check it now... My mistake is that character unicode 216 -> MacRoman was already false in Pharo 1.1. It was false because Pharo picked a a bogus table manually crafted from the internet pages (from Sophie project?) Then Sven did correct the table by automagically decoding the url... But this didn't correct anything because the initializeLatin1MapAndEncodings was never invoked (it was missing already in Pharo 1.1). Unfortunately, those maps are a speed-up cache and will mask the correction of table if not updated. In Squeak, initializeLatin1MapAndEncodings was called from class side initialization right from the beginning, but this was forgotten during the port to Pharo, that would be interesting to know why... Ah yes, lazy initialization made it work without the need for class initialization, but that was a one shot gun, not robust to further table changes, that's the drawback of being lazy. So, most probably code was too complex and this is enough to explain the mistakes. Why was it too complex? Because it was an optimization for speed (fast scanning of bytes NOT NEEDING ANY conversion). And the initialization was too much convoluted because it was reusing convoluted multilngual API. My feeling is that it's an effect of the "least possible change that could possibly extend functionality". For me, it's never enough to say "old converters were broken". There's allways to learn from one mistake and that's why I'm asking. My feeling is that Pharo guys allways sprint and never look behind. This is at the risk of repeating some mistake...
|
In other words Max, if you correctly invoke MacRomanTextConverter initializeLatin1MapAndEncodings. Then the old converter retrieves a good health. Now that you know that, you can remove the old converters and keep Zn modernized ones ;) 2016-03-03 0:56 GMT+01:00 Nicolas Cellier <[hidden email]>:
|
In reply to this post by Nicolas Cellier
> On 03 Mar 2016, at 00:56, Nicolas Cellier <[hidden email]> wrote: > > > > 2016-03-02 23:16 GMT+01:00 Henrik Sperre Johansen <[hidden email]>: > Not sure I'd say Squeak's (5.0 at least) MacRoman conversion is free of bugs > either, at least the "legacy" ByteTextConverter subclass in Pharo passes the > following: > > "U+0152, Latin capital ligature OE is codepoint 16rCE in mac-roman" > ((Character value: 16r0152) asString convertToEncoding: 'mac-roman') first > charCode = 16rCE. > > "Codepoint 170 in MacRoman is TM sign, U+2122" > ((Character value: 170) asString convertFromEncoding: 'mac-roman') first > charCode = 16r2122. > > Cheers, > Henry > > > > Yes, you're right, it's because squeak tables did and still use CP1252 instead of ISO8859-L1 and thus do not match unicode. That might have made sense when porting from mac to windows while keeping ByteString, but at least since the switch to unicode that's bogus. I guess it's still here because some in image fonts would support cp1252 but I am too tired to check it now... > > My mistake is that character unicode 216 -> MacRoman was already false in Pharo 1.1. > It was false because Pharo picked a a bogus table manually crafted from the internet pages (from Sophie project?) > > Then Sven did correct the table by automagically decoding the url... > But this didn't correct anything because the initializeLatin1MapAndEncodings was never invoked (it was missing already in Pharo 1.1). > Unfortunately, those maps are a speed-up cache and will mask the correction of table if not updated. > > In Squeak, initializeLatin1MapAndEncodings was called from class side initialization right from the beginning, but this was forgotten during the port to Pharo, that would be interesting to know why... > > Ah yes, lazy initialization made it work without the need for class initialization, but that was a one shot gun, not robust to further table changes, that's the drawback of being lazy. > > So, most probably code was too complex and this is enough to explain the mistakes. > Why was it too complex? > Because it was an optimization for speed (fast scanning of bytes NOT NEEDING ANY conversion). > And the initialization was too much convoluted because it was reusing convoluted multilngual API. > My feeling is that it's an effect of the "least possible change that could possibly extend functionality". > > For me, it's never enough to say "old converters were broken". > There's allways to learn from one mistake and that's why I'm asking. > My feeling is that Pharo guys allways sprint and never look behind. > This is at the risk of repeating some mistake... Yes, it was a bit cheap to say the old ones were broken, I just tried with the new ones and saw that there was no problem there (luckily ;-); of course this little bug is fixable. However, the old ones are bad/broken in many ways: semantics, api, behaviour, error handling, lack of functionality, hacked implementation. Also, the new ones have already been around for many years in Zn and have been in use ever since to do any and all HTTP encoding/decoding in Pharo since 1.3. Now, it is not easy to replace one for the other at the lower levels, due to the (intentional/by-design) API differences. The high level view is well explained in the first part of the Character Encoding and Resource Meta Description chapter of the Enterprise Pharo book (http://files.pharo.org/books/enterprisepharo/book/Zinc-Encoding-Meta/Zinc-Encoding-Meta.html). Sven |
In reply to this post by Nicolas Cellier
Thank you everyone for the explanations. How would you feel about opening an issue for Pharo 6 to remove TextConverter?
That does indeed fix the problem.
|
https://pharo.fogbugz.com/f/cases/17751/Remove-TextConverter
|
Thanks you all and I love the bug entry.
Hi nicolas I understand your point but you see nicolas we are not sprinting blindly. We are the guys that pushed SUnit Tests in Squeak back then and in Pharo as well. We are pushing for Quality Rules and a lot more. Now we are doing a lot of things because the world is moving and we have a dream to make comes true :) And you would commit to Pharo would help us too because we have a long list of great changes. - Epicea - Bootstrap - Bloc - Xtreams - Stef Le 4/3/16 11:28, Max Leske a écrit :
https://pharo.fogbugz.com/f/cases/17751/Remove-TextConverter |
Free forum by Nabble | Edit this page |