TextConverter is broken

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

TextConverter is broken

Max Leske
Just so I don’t forget: this test fails (since at least 1.3). I have determined the correct value (175) independently with the respective tables and by using uconv.

testConversionToFrom
        self assert: (('äöü' convertToEncoding: 'mac-roman') convertFromEncoding: 'mac-roman') = 'äöü'.
        self
                assert: ((Character value: 216) asString convertToEncoding: 'mac-roman')
                equals: (Character value: 175) asString

As far as I can tell, this flaw is not exclusive to the mac-roman encoding. However, latin1 and UTF8 are probably fine.

Max
Reply | Threaded
Open this post in threaded view
|

Re: TextConverter is broken

Sven Van Caekenberghe-2
Max,

> On 02 Mar 2016, at 19:57, Max Leske <[hidden email]> wrote:
>
> Just so I don’t forget: this test fails (since at least 1.3). I have determined the correct value (175) independently with the respective tables and by using uconv.
>
> testConversionToFrom
> self assert: (('äöü' convertToEncoding: 'mac-roman') convertFromEncoding: 'mac-roman') = 'äöü'.
> self
> assert: ((Character value: 216) asString convertToEncoding: 'mac-roman')
> equals: (Character value: 175) asString
>
> As far as I can tell, this flaw is not exclusive to the mac-roman encoding. However, latin1 and UTF8 are probably fine.
>
> Max

That is because you are using the wrong encoders.

| encoder string |
encoder := #macroman asZnCharacterEncoder.
string := 'äöü'.
(encoder decodeBytes: (encoder encodeString: string)) = string. " => true"
encoder encodeString: (Character value: 216) asString. " => #[175]"

We should probably replace the old ones with the newer ones in Pharo 6.

Regards,

Sven
Reply | Threaded
Open this post in threaded view
|

Re: TextConverter is broken

Nicolas Cellier


2016-03-02 20:18 GMT+01:00 Sven Van Caekenberghe <[hidden email]>:
Max,

> On 02 Mar 2016, at 19:57, Max Leske <[hidden email]> wrote:
>
> Just so I don’t forget: this test fails (since at least 1.3). I have determined the correct value (175) independently with the respective tables and by using uconv.
>
> testConversionToFrom
>       self assert: (('äöü' convertToEncoding: 'mac-roman') convertFromEncoding: 'mac-roman') = 'äöü'.
>       self
>               assert: ((Character value: 216) asString convertToEncoding: 'mac-roman')
>               equals: (Character value: 175) asString
>
> As far as I can tell, this flaw is not exclusive to the mac-roman encoding. However, latin1 and UTF8 are probably fine.
>
> Max

That is because you are using the wrong encoders.

| encoder string |
encoder := #macroman asZnCharacterEncoder.
string := 'äöü'.
(encoder decodeBytes: (encoder encodeString: string)) = string. " => true"
encoder encodeString: (Character value: 216) asString. " => #[175]"

We should probably replace the old ones with the newer ones in Pharo 6.

Regards,

Sven

+1, why the hell maintaining two versions, one bogus?

However, since this does not seem broken in Squeak, that makes a perfect example for investigating how and why such trivial error creeps in.


Reply | Threaded
Open this post in threaded view
|

Re: TextConverter is broken

Sven Van Caekenberghe-2
All the ZnByteEncoders are automagically initialised from official specs, for MacRoman

self generateByteToUnicodeSpec: 'http://unicode.org/Public/MAPPINGS/VENDORS/APPLE/ROMAN.TXT'

> On 02 Mar 2016, at 21:09, Nicolas Cellier <[hidden email]> wrote:
>
>
>
> 2016-03-02 20:18 GMT+01:00 Sven Van Caekenberghe <[hidden email]>:
> Max,
>
> > On 02 Mar 2016, at 19:57, Max Leske <[hidden email]> wrote:
> >
> > Just so I don’t forget: this test fails (since at least 1.3). I have determined the correct value (175) independently with the respective tables and by using uconv.
> >
> > testConversionToFrom
> >       self assert: (('äöü' convertToEncoding: 'mac-roman') convertFromEncoding: 'mac-roman') = 'äöü'.
> >       self
> >               assert: ((Character value: 216) asString convertToEncoding: 'mac-roman')
> >               equals: (Character value: 175) asString
> >
> > As far as I can tell, this flaw is not exclusive to the mac-roman encoding. However, latin1 and UTF8 are probably fine.
> >
> > Max
>
> That is because you are using the wrong encoders.
>
> | encoder string |
> encoder := #macroman asZnCharacterEncoder.
> string := 'äöü'.
> (encoder decodeBytes: (encoder encodeString: string)) = string. " => true"
> encoder encodeString: (Character value: 216) asString. " => #[175]"
>
> We should probably replace the old ones with the newer ones in Pharo 6.
>
> Regards,
>
> Sven
>
> +1, why the hell maintaining two versions, one bogus?
>
> However, since this does not seem broken in Squeak, that makes a perfect example for investigating how and why such trivial error creeps in.
>
>


Reply | Threaded
Open this post in threaded view
|

Re: TextConverter is broken

Nicolas Cellier


2016-03-02 21:24 GMT+01:00 Sven Van Caekenberghe <[hidden email]>:
All the ZnByteEncoders are automagically initialised from official specs, for MacRoman

self generateByteToUnicodeSpec: 'http://unicode.org/Public/MAPPINGS/VENDORS/APPLE/ROMAN.TXT'


This is certainly good and robust practice, and I don't suggest removing the correct version!
My question was more about the alternate library.
Before removing it, may I ask how the hell did it get broken?
It's not broken in Squeak and was not broken in pharo 1.4 for example.
Doesn't tell it something?

 
> On 02 Mar 2016, at 21:09, Nicolas Cellier <[hidden email]> wrote:
>
>
>
> 2016-03-02 20:18 GMT+01:00 Sven Van Caekenberghe <[hidden email]>:
> Max,
>
> > On 02 Mar 2016, at 19:57, Max Leske <[hidden email]> wrote:
> >
> > Just so I don’t forget: this test fails (since at least 1.3). I have determined the correct value (175) independently with the respective tables and by using uconv.
> >
> > testConversionToFrom
> >       self assert: (('äöü' convertToEncoding: 'mac-roman') convertFromEncoding: 'mac-roman') = 'äöü'.
> >       self
> >               assert: ((Character value: 216) asString convertToEncoding: 'mac-roman')
> >               equals: (Character value: 175) asString
> >
> > As far as I can tell, this flaw is not exclusive to the mac-roman encoding. However, latin1 and UTF8 are probably fine.
> >
> > Max
>
> That is because you are using the wrong encoders.
>
> | encoder string |
> encoder := #macroman asZnCharacterEncoder.
> string := 'äöü'.
> (encoder decodeBytes: (encoder encodeString: string)) = string. " => true"
> encoder encodeString: (Character value: 216) asString. " => #[175]"
>
> We should probably replace the old ones with the newer ones in Pharo 6.
>
> Regards,
>
> Sven
>
> +1, why the hell maintaining two versions, one bogus?
>
> However, since this does not seem broken in Squeak, that makes a perfect example for investigating how and why such trivial error creeps in.
>
>



Reply | Threaded
Open this post in threaded view
|

Re: TextConverter is broken

Henrik Sperre Johansen
In reply to this post by Nicolas Cellier
There was a bug in the old byteToUnicodeSpec tables (missing 2's in some of the latter entries, one of which should have been 16e02D8).

So initializeLatin1MapAndEncodings erroneously picked that over the correct D8.

In a modern image, Sven has fixed the byteToUnicodeSpec (They're parsed directly from the Unicode site, so at least the hope is they are correct, at least), but the cache tables have not been updates, one can do this by executing:

ByteTextConverter initialize.
ByteTextConverter allSubclassesDo: [:each | each initializeLatin1MapAndEncodings].

Cheers,
Henry
Reply | Threaded
Open this post in threaded view
|

Re: TextConverter is broken

Henrik Sperre Johansen
If, by, "modern", we mean 1.4 onwards...
(The ByteTextConverter tables *were* updated it seems, so just

MacRomanTextConverter initializeLatin1MapAndEncodings.
(Character value: 216) asString convertToEncoding: 'mac-roman'

should yield correct results)

Cheers,
Henry
Reply | Threaded
Open this post in threaded view
|

Re: TextConverter is broken

Henrik Sperre Johansen
Not sure I'd say Squeak's (5.0 at least) MacRoman conversion is free of bugs either, at least the "legacy" ByteTextConverter subclass in Pharo passes the following:

"U+0152, Latin capital ligature OE is codepoint 16rCE in mac-roman"
((Character value: 16r0152) asString convertToEncoding: 'mac-roman') first charCode = 16rCE.  

"Codepoint 170 in MacRoman is TM sign, U+2122"
((Character value: 170) asString convertFromEncoding: 'mac-roman') first charCode = 16r2122.

Cheers,
Henry
Reply | Threaded
Open this post in threaded view
|

Re: TextConverter is broken

Nicolas Cellier


2016-03-02 23:16 GMT+01:00 Henrik Sperre Johansen <[hidden email]>:
Not sure I'd say Squeak's (5.0 at least) MacRoman conversion is free of bugs
either, at least the "legacy" ByteTextConverter subclass in Pharo passes the
following:

"U+0152, Latin capital ligature OE is codepoint 16rCE in mac-roman"
((Character value: 16r0152) asString convertToEncoding: 'mac-roman') first
charCode = 16rCE.

"Codepoint 170 in MacRoman is TM sign, U+2122"
((Character value: 170) asString convertFromEncoding: 'mac-roman') first
charCode = 16r2122.

Cheers,
Henry



Yes, you're right, it's because squeak tables did and still use CP1252 instead of ISO8859-L1 and thus do not match unicode. That might have made sense when porting from mac to windows while keeping ByteString, but at least since the switch to unicode that's bogus. I guess it's still here because some in image fonts would support cp1252 but I am too tired to check it now...

My mistake is that character unicode 216 -> MacRoman was already false in Pharo 1.1.
It was false because Pharo picked a a bogus table manually crafted from the internet pages (from Sophie project?)

Then Sven did correct the table by automagically decoding the url...
But this didn't correct anything because the initializeLatin1MapAndEncodings was never invoked (it was missing already in Pharo 1.1).
Unfortunately, those maps are a speed-up cache and will mask the correction of table if not updated.

In Squeak, initializeLatin1MapAndEncodings was called from class side initialization right from the beginning, but this was forgotten during the port to Pharo, that would be interesting to know why...

Ah yes, lazy initialization made it work without the need for class initialization, but that was a one shot gun, not robust to further table changes, that's the drawback of being lazy.

So, most probably code was too complex and this is enough to explain the mistakes.
Why was it too complex?
Because it was an optimization for speed (fast scanning of bytes NOT NEEDING ANY conversion).
And the initialization was too much convoluted because it was reusing convoluted multilngual API.
My feeling is that it's an effect of the "least possible change that could possibly extend functionality".

For me, it's never enough to say "old converters were broken".
There's allways to learn from one mistake and that's why I'm asking.
My feeling is that Pharo guys allways sprint and never look behind.
This is at the risk of repeating some mistake...

 

--
View this message in context: http://forum.world.st/TextConverter-is-broken-tp4882039p4882095.html
Sent from the Pharo Smalltalk Developers mailing list archive at Nabble.com.


Reply | Threaded
Open this post in threaded view
|

Re: TextConverter is broken

Nicolas Cellier
In other words Max, if you correctly invoke

    MacRomanTextConverter initializeLatin1MapAndEncodings.

Then the old converter retrieves a good health.
Now that you know that, you can remove the old converters and keep Zn modernized ones ;)

2016-03-03 0:56 GMT+01:00 Nicolas Cellier <[hidden email]>:


2016-03-02 23:16 GMT+01:00 Henrik Sperre Johansen <[hidden email]>:
Not sure I'd say Squeak's (5.0 at least) MacRoman conversion is free of bugs
either, at least the "legacy" ByteTextConverter subclass in Pharo passes the
following:

"U+0152, Latin capital ligature OE is codepoint 16rCE in mac-roman"
((Character value: 16r0152) asString convertToEncoding: 'mac-roman') first
charCode = 16rCE.

"Codepoint 170 in MacRoman is TM sign, U+2122"
((Character value: 170) asString convertFromEncoding: 'mac-roman') first
charCode = 16r2122.

Cheers,
Henry



Yes, you're right, it's because squeak tables did and still use CP1252 instead of ISO8859-L1 and thus do not match unicode. That might have made sense when porting from mac to windows while keeping ByteString, but at least since the switch to unicode that's bogus. I guess it's still here because some in image fonts would support cp1252 but I am too tired to check it now...

My mistake is that character unicode 216 -> MacRoman was already false in Pharo 1.1.
It was false because Pharo picked a a bogus table manually crafted from the internet pages (from Sophie project?)

Then Sven did correct the table by automagically decoding the url...
But this didn't correct anything because the initializeLatin1MapAndEncodings was never invoked (it was missing already in Pharo 1.1).
Unfortunately, those maps are a speed-up cache and will mask the correction of table if not updated.

In Squeak, initializeLatin1MapAndEncodings was called from class side initialization right from the beginning, but this was forgotten during the port to Pharo, that would be interesting to know why...

Ah yes, lazy initialization made it work without the need for class initialization, but that was a one shot gun, not robust to further table changes, that's the drawback of being lazy.

So, most probably code was too complex and this is enough to explain the mistakes.
Why was it too complex?
Because it was an optimization for speed (fast scanning of bytes NOT NEEDING ANY conversion).
And the initialization was too much convoluted because it was reusing convoluted multilngual API.
My feeling is that it's an effect of the "least possible change that could possibly extend functionality".

For me, it's never enough to say "old converters were broken".
There's allways to learn from one mistake and that's why I'm asking.
My feeling is that Pharo guys allways sprint and never look behind.
This is at the risk of repeating some mistake...

 

--
View this message in context: http://forum.world.st/TextConverter-is-broken-tp4882039p4882095.html
Sent from the Pharo Smalltalk Developers mailing list archive at Nabble.com.



Reply | Threaded
Open this post in threaded view
|

Re: TextConverter is broken

Sven Van Caekenberghe-2
In reply to this post by Nicolas Cellier

> On 03 Mar 2016, at 00:56, Nicolas Cellier <[hidden email]> wrote:
>
>
>
> 2016-03-02 23:16 GMT+01:00 Henrik Sperre Johansen <[hidden email]>:
> Not sure I'd say Squeak's (5.0 at least) MacRoman conversion is free of bugs
> either, at least the "legacy" ByteTextConverter subclass in Pharo passes the
> following:
>
> "U+0152, Latin capital ligature OE is codepoint 16rCE in mac-roman"
> ((Character value: 16r0152) asString convertToEncoding: 'mac-roman') first
> charCode = 16rCE.
>
> "Codepoint 170 in MacRoman is TM sign, U+2122"
> ((Character value: 170) asString convertFromEncoding: 'mac-roman') first
> charCode = 16r2122.
>
> Cheers,
> Henry
>
>
>
> Yes, you're right, it's because squeak tables did and still use CP1252 instead of ISO8859-L1 and thus do not match unicode. That might have made sense when porting from mac to windows while keeping ByteString, but at least since the switch to unicode that's bogus. I guess it's still here because some in image fonts would support cp1252 but I am too tired to check it now...
>
> My mistake is that character unicode 216 -> MacRoman was already false in Pharo 1.1.
> It was false because Pharo picked a a bogus table manually crafted from the internet pages (from Sophie project?)
>
> Then Sven did correct the table by automagically decoding the url...
> But this didn't correct anything because the initializeLatin1MapAndEncodings was never invoked (it was missing already in Pharo 1.1).
> Unfortunately, those maps are a speed-up cache and will mask the correction of table if not updated.
>
> In Squeak, initializeLatin1MapAndEncodings was called from class side initialization right from the beginning, but this was forgotten during the port to Pharo, that would be interesting to know why...
>
> Ah yes, lazy initialization made it work without the need for class initialization, but that was a one shot gun, not robust to further table changes, that's the drawback of being lazy.
>
> So, most probably code was too complex and this is enough to explain the mistakes.
> Why was it too complex?
> Because it was an optimization for speed (fast scanning of bytes NOT NEEDING ANY conversion).
> And the initialization was too much convoluted because it was reusing convoluted multilngual API.
> My feeling is that it's an effect of the "least possible change that could possibly extend functionality".
>
> For me, it's never enough to say "old converters were broken".
> There's allways to learn from one mistake and that's why I'm asking.
> My feeling is that Pharo guys allways sprint and never look behind.
> This is at the risk of repeating some mistake...

Yes, it was a bit cheap to say the old ones were broken, I just tried with the new ones and saw that there was no problem there (luckily ;-); of course this little bug is fixable.

However, the old ones are bad/broken in many ways: semantics, api, behaviour, error handling, lack of functionality, hacked implementation.

Also, the new ones have already been around for many years in Zn and have been in use ever since to do any and all HTTP encoding/decoding in Pharo since 1.3.

Now, it is not easy to replace one for the other at the lower levels, due to the (intentional/by-design) API differences.

The high level view is well explained in the first part of the Character Encoding and Resource Meta Description chapter of the Enterprise Pharo book (http://files.pharo.org/books/enterprisepharo/book/Zinc-Encoding-Meta/Zinc-Encoding-Meta.html).

Sven




Reply | Threaded
Open this post in threaded view
|

Re: TextConverter is broken

Max Leske
In reply to this post by Nicolas Cellier
Thank you everyone for the explanations.

How would you feel about opening an issue for Pharo 6 to remove TextConverter?

On 03 Mar 2016, at 01:04, Nicolas Cellier <[hidden email]> wrote:

In other words Max, if you correctly invoke

    MacRomanTextConverter initializeLatin1MapAndEncodings.

Then the old converter retrieves a good health.
Now that you know that, you can remove the old converters and keep Zn modernized ones ;)

That does indeed fix the problem. 


2016-03-03 0:56 GMT+01:00 Nicolas Cellier <[hidden email]>:


2016-03-02 23:16 GMT+01:00 Henrik Sperre Johansen <[hidden email]>:
Not sure I'd say Squeak's (5.0 at least) MacRoman conversion is free of bugs
either, at least the "legacy" ByteTextConverter subclass in Pharo passes the
following:

"U+0152, Latin capital ligature OE is codepoint 16rCE in mac-roman"
((Character value: 16r0152) asString convertToEncoding: 'mac-roman') first
charCode = 16rCE.

"Codepoint 170 in MacRoman is TM sign, U+2122"
((Character value: 170) asString convertFromEncoding: 'mac-roman') first
charCode = 16r2122.

Cheers,
Henry



Yes, you're right, it's because squeak tables did and still use CP1252 instead of ISO8859-L1 and thus do not match unicode. That might have made sense when porting from mac to windows while keeping ByteString, but at least since the switch to unicode that's bogus. I guess it's still here because some in image fonts would support cp1252 but I am too tired to check it now...

My mistake is that character unicode 216 -> MacRoman was already false in Pharo 1.1.
It was false because Pharo picked a a bogus table manually crafted from the internet pages (from Sophie project?)

Then Sven did correct the table by automagically decoding the url...
But this didn't correct anything because the initializeLatin1MapAndEncodings was never invoked (it was missing already in Pharo 1.1).
Unfortunately, those maps are a speed-up cache and will mask the correction of table if not updated.

In Squeak, initializeLatin1MapAndEncodings was called from class side initialization right from the beginning, but this was forgotten during the port to Pharo, that would be interesting to know why...

Ah yes, lazy initialization made it work without the need for class initialization, but that was a one shot gun, not robust to further table changes, that's the drawback of being lazy.

So, most probably code was too complex and this is enough to explain the mistakes.
Why was it too complex?
Because it was an optimization for speed (fast scanning of bytes NOT NEEDING ANY conversion).
And the initialization was too much convoluted because it was reusing convoluted multilngual API.
My feeling is that it's an effect of the "least possible change that could possibly extend functionality".

For me, it's never enough to say "old converters were broken".
There's allways to learn from one mistake and that's why I'm asking.
My feeling is that Pharo guys allways sprint and never look behind.
This is at the risk of repeating some mistake...

 

--
View this message in context: http://forum.world.st/TextConverter-is-broken-tp4882039p4882095.html
Sent from the Pharo Smalltalk Developers mailing list archive at Nabble.com.




Reply | Threaded
Open this post in threaded view
|

Re: TextConverter is broken

Max Leske
https://pharo.fogbugz.com/f/cases/17751/Remove-TextConverter


On 03 Mar 2016, at 09:32, Max Leske <[hidden email]> wrote:

Thank you everyone for the explanations.

How would you feel about opening an issue for Pharo 6 to remove TextConverter?

On 03 Mar 2016, at 01:04, Nicolas Cellier <[hidden email]> wrote:

In other words Max, if you correctly invoke

    MacRomanTextConverter initializeLatin1MapAndEncodings.

Then the old converter retrieves a good health.
Now that you know that, you can remove the old converters and keep Zn modernized ones ;)

That does indeed fix the problem. 


2016-03-03 0:56 GMT+01:00 Nicolas Cellier <[hidden email]>:


2016-03-02 23:16 GMT+01:00 Henrik Sperre Johansen <[hidden email]>:
Not sure I'd say Squeak's (5.0 at least) MacRoman conversion is free of bugs
either, at least the "legacy" ByteTextConverter subclass in Pharo passes the
following:

"U+0152, Latin capital ligature OE is codepoint 16rCE in mac-roman"
((Character value: 16r0152) asString convertToEncoding: 'mac-roman') first
charCode = 16rCE.

"Codepoint 170 in MacRoman is TM sign, U+2122"
((Character value: 170) asString convertFromEncoding: 'mac-roman') first
charCode = 16r2122.

Cheers,
Henry



Yes, you're right, it's because squeak tables did and still use CP1252 instead of ISO8859-L1 and thus do not match unicode. That might have made sense when porting from mac to windows while keeping ByteString, but at least since the switch to unicode that's bogus. I guess it's still here because some in image fonts would support cp1252 but I am too tired to check it now...

My mistake is that character unicode 216 -> MacRoman was already false in Pharo 1.1.
It was false because Pharo picked a a bogus table manually crafted from the internet pages (from Sophie project?)

Then Sven did correct the table by automagically decoding the url...
But this didn't correct anything because the initializeLatin1MapAndEncodings was never invoked (it was missing already in Pharo 1.1).
Unfortunately, those maps are a speed-up cache and will mask the correction of table if not updated.

In Squeak, initializeLatin1MapAndEncodings was called from class side initialization right from the beginning, but this was forgotten during the port to Pharo, that would be interesting to know why...

Ah yes, lazy initialization made it work without the need for class initialization, but that was a one shot gun, not robust to further table changes, that's the drawback of being lazy.

So, most probably code was too complex and this is enough to explain the mistakes.
Why was it too complex?
Because it was an optimization for speed (fast scanning of bytes NOT NEEDING ANY conversion).
And the initialization was too much convoluted because it was reusing convoluted multilngual API.
My feeling is that it's an effect of the "least possible change that could possibly extend functionality".

For me, it's never enough to say "old converters were broken".
There's allways to learn from one mistake and that's why I'm asking.
My feeling is that Pharo guys allways sprint and never look behind.
This is at the risk of repeating some mistake...

 

--
View this message in context: http://forum.world.st/TextConverter-is-broken-tp4882039p4882095.html
Sent from the Pharo Smalltalk Developers mailing list archive at Nabble.com.





Reply | Threaded
Open this post in threaded view
|

Re: TextConverter is broken

stepharo
Thanks you all and I love the bug entry.
Hi nicolas I understand your point but you see nicolas we are not sprinting blindly.
We are the guys that pushed SUnit Tests in Squeak back then and in Pharo as well.
We are pushing for Quality Rules and a lot more.
Now we are doing a lot of things because the world is moving
and we have a dream to make comes true :)
And you would commit to Pharo would help us too because we have a long list
of great changes.
    - Epicea
    - Bootstrap
    - Bloc
    - Xtreams
    -

Stef

Le 4/3/16 11:28, Max Leske a écrit :
https://pharo.fogbugz.com/f/cases/17751/Remove-TextConverter


On 03 Mar 2016, at 09:32, Max Leske [hidden email] wrote:

Thank you everyone for the explanations.

How would you feel about opening an issue for Pharo 6 to remove TextConverter?

On 03 Mar 2016, at 01:04, Nicolas Cellier <[hidden email]> wrote:

In other words Max, if you correctly invoke

    MacRomanTextConverter initializeLatin1MapAndEncodings.

Then the old converter retrieves a good health.
Now that you know that, you can remove the old converters and keep Zn modernized ones ;)

That does indeed fix the problem. 


2016-03-03 0:56 GMT+01:00 Nicolas Cellier <[hidden email]>:


2016-03-02 23:16 GMT+01:00 Henrik Sperre Johansen <[hidden email]>:
Not sure I'd say Squeak's (5.0 at least) MacRoman conversion is free of bugs
either, at least the "legacy" ByteTextConverter subclass in Pharo passes the
following:

"U+0152, Latin capital ligature OE is codepoint 16rCE in mac-roman"
((Character value: 16r0152) asString convertToEncoding: 'mac-roman') first
charCode = 16rCE.

"Codepoint 170 in MacRoman is TM sign, U+2122"
((Character value: 170) asString convertFromEncoding: 'mac-roman') first
charCode = 16r2122.

Cheers,
Henry



Yes, you're right, it's because squeak tables did and still use CP1252 instead of ISO8859-L1 and thus do not match unicode. That might have made sense when porting from mac to windows while keeping ByteString, but at least since the switch to unicode that's bogus. I guess it's still here because some in image fonts would support cp1252 but I am too tired to check it now...

My mistake is that character unicode 216 -> MacRoman was already false in Pharo 1.1.
It was false because Pharo picked a a bogus table manually crafted from the internet pages (from Sophie project?)

Then Sven did correct the table by automagically decoding the url...
But this didn't correct anything because the initializeLatin1MapAndEncodings was never invoked (it was missing already in Pharo 1.1).
Unfortunately, those maps are a speed-up cache and will mask the correction of table if not updated.

In Squeak, initializeLatin1MapAndEncodings was called from class side initialization right from the beginning, but this was forgotten during the port to Pharo, that would be interesting to know why...

Ah yes, lazy initialization made it work without the need for class initialization, but that was a one shot gun, not robust to further table changes, that's the drawback of being lazy.

So, most probably code was too complex and this is enough to explain the mistakes.
Why was it too complex?
Because it was an optimization for speed (fast scanning of bytes NOT NEEDING ANY conversion).
And the initialization was too much convoluted because it was reusing convoluted multilngual API.
My feeling is that it's an effect of the "least possible change that could possibly extend functionality".

For me, it's never enough to say "old converters were broken".
There's allways to learn from one mistake and that's why I'm asking.
My feeling is that Pharo guys allways sprint and never look behind.
This is at the risk of repeating some mistake...

 

--
View this message in context: http://forum.world.st/TextConverter-is-broken-tp4882039p4882095.html
Sent from the Pharo Smalltalk Developers mailing list archive at Nabble.com.