Smalltalk › Pharo › Pharo Smalltalk Developers

adding ISO-8859-15 and CP-1252 support

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

6 messages Options

Philippe Marschall-2-3

adding ISO-8859-15 and CP-1252 support

Hi

I decided to write ISO-8859-15 and CP-1252 support [1] (mostly for
selfish reasons so that Seaside on Pharo would support ISO-8859-15 and
CP-1252).

A couple of notes:
- the five unmapped bytes of CP-1252 (not ISO-8859-15, the comment is
wrong) are mapped to the Unicode replacement character (U+FFFD)
- a new Latin9 language environment is introduced
- some minor clean up like removing unused class variables

I'd appreciate it if somebody knowledgeable in these areas could review
the changes. I'm especially unsure about the Latin9 language
environment, but reusing Latin1 or Unicode seemed wrong.

[1] http://code.google.com/p/pharo/issues/detail?id=2812

Cheers
Philippe

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Philippe Marschall-2

Re: adding ISO-8859-15 and CP-1252 support

On 08/16/2010 09:49 PM, Philippe Marschall wrote:

> Hi
>
> I decided to write ISO-8859-15 and CP-1252 support [1] (mostly for
> selfish reasons so that Seaside on Pharo would support ISO-8859-15 and
> CP-1252).
>
> A couple of notes:
> - the five unmapped bytes of CP-1252 (not ISO-8859-15, the comment is
> wrong) are mapped to the Unicode replacement character (U+FFFD)
> - a new Latin9 language environment is introduced

I also snatched the first free leading char (17) for this.

Cheers
Philippe

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Henrik Sperre Johansen

Re: adding ISO-8859-15 and CP-1252 support

In reply to this post by Philippe Marschall-2-3

On Aug 16, 2010, at 9:49 30PM, Philippe Marschall wrote:

> Hi
>
> I decided to write ISO-8859-15 and CP-1252 support [1] (mostly for
> selfish reasons so that Seaside on Pharo would support ISO-8859-15 and
> CP-1252).

More converters are always nice :D
Their code seems ok to me.

>
> A couple of notes:
> - the five unmapped bytes of CP-1252 (not ISO-8859-15, the comment is
> wrong) are mapped to the Unicode replacement character (U+FFFD)
> - a new Latin9 language environment is introduced
> - some minor clean up like removing unused class variables
>
> I'd appreciate it if somebody knowledgeable in these areas could review
> the changes. I'm especially unsure about the Latin9 language
> environment, but reusing Latin1 or Unicode seemed wrong.

I'm not sure its too wrong, according to EncodedCharSet comment:
"The other confusion comes from the name of "Latin1" class. It used to mean the Latin-1 (ISO-8859-1) character set, but now it primarily means that the "Western European languages that are covered by the characters in Latin-1 character set."
I'd reckon the same holds true for Latin1Environment (Western ), Latin2Environment (Eastern), and Latin7Environment (Greek). I don't think CP1252/8859-15 warrants the same as they are basically alternative encodings to latin1 for western languages.

Also:
- leadingChar is used in StrikeFontSet to choose different glyph sets. This allows for StrikeFonts supporting more than the default latin1 glyphs, seems to me it would be "wrong" to use the same one for two different encodings.
Not sure why this approach was taken rather than allowing additional strike font sets based on unicode code point ranges, then using leadingChar only to differentiate when the visual glyphs for those code points would be different. I suspect it maybe was developed to deal with Han unification first, then reused to support multiple character sets later.

- LanguageEnvironment seems to have been used in conjunction with translation (note the entire old translation system was removed in Pharo and replaced by an external package), maybe to decide which encoding externally stored translation files should be read in as.
Then, having environments with overlapping supportedLanguages seem somewhat weird as well.
Modifying defaultEncodingName/systemConverterClass of Latin1Environment to use CP1252 for some Windows systems (as per Latin2) may be another approach, may or may not lead to unintended consequences elsewhere though, I did not investigate all uses.

IMHO, for someone who wasn't involved in its developemnt, the whole multilingual package could use some cleaning, more class comments, and clearer statement of responsibilities.

Cheers,
Henry

TLDR;
More converters: yay!
More LanguageEnvironments: o_O, not sure
_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Stéphane Ducasse

Re: adding ISO-8859-15 and CP-1252 support

henrik

thanks for the feedback.
do you have any ideas of simple comments that could help?
Because this part of pharo is just dark :)

Stef

On Aug 17, 2010, at 4:55 PM, Henrik Johansen wrote:

>
> On Aug 16, 2010, at 9:49 30PM, Philippe Marschall wrote:
>
>> Hi
>>
>> I decided to write ISO-8859-15 and CP-1252 support [1] (mostly for
>> selfish reasons so that Seaside on Pharo would support ISO-8859-15 and
>> CP-1252).
>
>
> More converters are always nice :D
> Their code seems ok to me.
>>
>> A couple of notes:
>> - the five unmapped bytes of CP-1252 (not ISO-8859-15, the comment is
>> wrong) are mapped to the Unicode replacement character (U+FFFD)
>> - a new Latin9 language environment is introduced
>> - some minor clean up like removing unused class variables
>>
>> I'd appreciate it if somebody knowledgeable in these areas could review
>> the changes. I'm especially unsure about the Latin9 language
>> environment, but reusing Latin1 or Unicode seemed wrong.
>
> I'm not sure its too wrong, according to EncodedCharSet comment:
> "The other confusion comes from the name of "Latin1" class. It used to mean the Latin-1 (ISO-8859-1) character set, but now it primarily means that the "Western European languages that are covered by the characters in Latin-1 character set."
> I'd reckon the same holds true for Latin1Environment (Western ), Latin2Environment (Eastern), and Latin7Environment (Greek). I don't think CP1252/8859-15 warrants the same as they are basically alternative encodings to latin1 for western languages.
>
> Also:
> - leadingChar is used in StrikeFontSet to choose different glyph sets. This allows for StrikeFonts supporting more than the default latin1 glyphs, seems to me it would be "wrong" to use the same one for two different encodings.
> Not sure why this approach was taken rather than allowing additional strike font sets based on unicode code point ranges, then using leadingChar only to differentiate when the visual glyphs for those code points would be different. I suspect it maybe was developed to deal with Han unification first, then reused to support multiple character sets later.
>
> - LanguageEnvironment seems to have been used in conjunction with translation (note the entire old translation system was removed in Pharo and replaced by an external package), maybe to decide which encoding externally stored translation files should be read in as.
> Then, having environments with overlapping supportedLanguages seem somewhat weird as well.
> Modifying defaultEncodingName/systemConverterClass of Latin1Environment to use CP1252 for some Windows systems (as per Latin2) may be another approach, may or may not lead to unintended consequences elsewhere though, I did not investigate all uses.
>
> IMHO, for someone who wasn't involved in its developemnt, the whole multilingual package could use some cleaning, more class comments, and clearer statement of responsibilities.
>
> Cheers,
> Henry
>
> TLDR;
> More converters: yay!
> More LanguageEnvironments: o_O, not sure
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Philippe Marschall-2

Re: adding ISO-8859-15 and CP-1252 support

In reply to this post by Henrik Sperre Johansen

On 08/17/2010 04:55 PM, Henrik Johansen wrote:

OK, if nobody says it's a good idea and the right thing to do I'll drop
the LanguageEnvironment.

Cheers
Philippe

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Stéphane Ducasse

Re: adding ISO-8859-15 and CP-1252 support

may be you should contact yoshiki.

On Aug 18, 2010, at 9:59 AM, Philippe Marschall wrote:

> On 08/17/2010 04:55 PM, Henrik Johansen wrote:
>>
>> On Aug 16, 2010, at 9:49 30PM, Philippe Marschall wrote:
>>
>>> Hi
>>>
>>> I decided to write ISO-8859-15 and CP-1252 support [1] (mostly for
>>> selfish reasons so that Seaside on Pharo would support ISO-8859-15 and
>>> CP-1252).
>>
>>
>> More converters are always nice :D
>> Their code seems ok to me.
>>>
>>> A couple of notes:
>>> - the five unmapped bytes of CP-1252 (not ISO-8859-15, the comment is
>>> wrong) are mapped to the Unicode replacement character (U+FFFD)
>>> - a new Latin9 language environment is introduced
>>> - some minor clean up like removing unused class variables
>>>
>>> I'd appreciate it if somebody knowledgeable in these areas could review
>>> the changes. I'm especially unsure about the Latin9 language
>>> environment, but reusing Latin1 or Unicode seemed wrong.
>>
>> I'm not sure its too wrong, according to EncodedCharSet comment:
>> "The other confusion comes from the name of "Latin1" class. It used to mean the Latin-1 (ISO-8859-1) character set, but now it primarily means that the "Western European languages that are covered by the characters in Latin-1 character set."
>> I'd reckon the same holds true for Latin1Environment (Western ), Latin2Environment (Eastern), and Latin7Environment (Greek). I don't think CP1252/8859-15 warrants the same as they are basically alternative encodings to latin1 for western languages.
>>
>> Also:
>> - leadingChar is used in StrikeFontSet to choose different glyph sets. This allows for StrikeFonts supporting more than the default latin1 glyphs, seems to me it would be "wrong" to use the same one for two different encodings.
>> Not sure why this approach was taken rather than allowing additional strike font sets based on unicode code point ranges, then using leadingChar only to differentiate when the visual glyphs for those code points would be different. I suspect it maybe was developed to deal with Han unification first, then reused to support multiple character sets later.
>>
>> - LanguageEnvironment seems to have been used in conjunction with translation (note the entire old translation system was removed in Pharo and replaced by an external package), maybe to decide which encoding externally stored translation files should be read in as.
>> Then, having environments with overlapping supportedLanguages seem somewhat weird as well.
>> Modifying defaultEncodingName/systemConverterClass of Latin1Environment to use CP1252 for some Windows systems (as per Latin2) may be another approach, may or may not lead to unintended consequences elsewhere though, I did not investigate all uses.
>>
>> IMHO, for someone who wasn't involved in its developemnt, the whole multilingual package could use some cleaning, more class comments, and clearer statement of responsibilities.
>>
>> Cheers,
>> Henry
>>
>> TLDR;
>> More converters: yay!
>> More LanguageEnvironments: o_O, not sure
>
> OK, if nobody says it's a good idea and the right thing to do I'll drop
> the LanguageEnvironment.
>
> Cheers
> Philippe
>
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project