adding ISO-8859-15 and CP-1252 support

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

adding ISO-8859-15 and CP-1252 support

Philippe Marschall-2-3
Hi

I decided to write ISO-8859-15 and CP-1252 support [1] (mostly for
selfish reasons so that Seaside on Pharo would support ISO-8859-15 and
CP-1252).

A couple of notes:
 - the five unmapped bytes of CP-1252 (not ISO-8859-15, the comment is
wrong) are mapped to the Unicode replacement character (U+FFFD)
 - a new Latin9 language environment is introduced
 - some minor clean up like removing unused class variables

I'd appreciate it if somebody knowledgeable in these areas could review
the changes. I'm especially unsure about the Latin9 language
environment, but reusing Latin1 or Unicode seemed wrong.

 [1] http://code.google.com/p/pharo/issues/detail?id=2812

Cheers
Philippe


_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: adding ISO-8859-15 and CP-1252 support

Philippe Marschall-2
On 08/16/2010 09:49 PM, Philippe Marschall wrote:

> Hi
>
> I decided to write ISO-8859-15 and CP-1252 support [1] (mostly for
> selfish reasons so that Seaside on Pharo would support ISO-8859-15 and
> CP-1252).
>
> A couple of notes:
>  - the five unmapped bytes of CP-1252 (not ISO-8859-15, the comment is
> wrong) are mapped to the Unicode replacement character (U+FFFD)
>  - a new Latin9 language environment is introduced

I also snatched the first free leading char (17) for this.

Cheers
Philippe


_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: adding ISO-8859-15 and CP-1252 support

Henrik Sperre Johansen
In reply to this post by Philippe Marschall-2-3

On Aug 16, 2010, at 9:49 30PM, Philippe Marschall wrote:

> Hi
>
> I decided to write ISO-8859-15 and CP-1252 support [1] (mostly for
> selfish reasons so that Seaside on Pharo would support ISO-8859-15 and
> CP-1252).


More converters are always nice :D
Their code seems ok to me.

>
> A couple of notes:
> - the five unmapped bytes of CP-1252 (not ISO-8859-15, the comment is
> wrong) are mapped to the Unicode replacement character (U+FFFD)
> - a new Latin9 language environment is introduced
> - some minor clean up like removing unused class variables
>
> I'd appreciate it if somebody knowledgeable in these areas could review
> the changes. I'm especially unsure about the Latin9 language
> environment, but reusing Latin1 or Unicode seemed wrong.

I'm not sure its too wrong, according to EncodedCharSet comment:
"The other confusion comes from the name of "Latin1" class.  It used to mean the Latin-1 (ISO-8859-1) character set, but now it primarily means that the "Western European languages that are covered by the characters in Latin-1 character set."
I'd reckon the same holds true for Latin1Environment (Western ), Latin2Environment (Eastern), and Latin7Environment (Greek). I don't think CP1252/8859-15 warrants the same as they are basically alternative encodings to latin1 for western languages.

Also:
- leadingChar is used in StrikeFontSet to choose different glyph sets. This allows for StrikeFonts supporting more than the default latin1 glyphs, seems to me it would be "wrong" to use the same one for two different encodings.
Not sure why this approach was taken rather than allowing additional strike font sets based on unicode code point ranges, then using leadingChar only to differentiate when the visual glyphs for those code points would be different. I suspect it maybe was developed to deal with Han unification first, then reused to support multiple character sets later.

- LanguageEnvironment seems to have been used in conjunction with translation (note the entire old translation system was removed in Pharo and replaced by an external package), maybe to decide which encoding externally stored translation files should be read in as.
Then, having environments with overlapping supportedLanguages seem somewhat weird as well.
Modifying defaultEncodingName/systemConverterClass of Latin1Environment to use CP1252 for some Windows systems (as per Latin2) may be another approach, may or may not lead to unintended consequences elsewhere though, I did not investigate all uses.

IMHO, for someone who wasn't involved in its developemnt, the whole multilingual package could use some cleaning, more class comments, and clearer statement of responsibilities.

Cheers,
Henry

TLDR;
More converters: yay!
More LanguageEnvironments: o_O, not sure
_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: adding ISO-8859-15 and CP-1252 support

Stéphane Ducasse
henrik

thanks for the feedback.
do you have any ideas of simple comments that could help?
Because this part of pharo is just dark :)

Stef

On Aug 17, 2010, at 4:55 PM, Henrik Johansen wrote:

>
> On Aug 16, 2010, at 9:49 30PM, Philippe Marschall wrote:
>
>> Hi
>>
>> I decided to write ISO-8859-15 and CP-1252 support [1] (mostly for
>> selfish reasons so that Seaside on Pharo would support ISO-8859-15 and
>> CP-1252).
>
>
> More converters are always nice :D
> Their code seems ok to me.
>>
>> A couple of notes:
>> - the five unmapped bytes of CP-1252 (not ISO-8859-15, the comment is
>> wrong) are mapped to the Unicode replacement character (U+FFFD)
>> - a new Latin9 language environment is introduced
>> - some minor clean up like removing unused class variables
>>
>> I'd appreciate it if somebody knowledgeable in these areas could review
>> the changes. I'm especially unsure about the Latin9 language
>> environment, but reusing Latin1 or Unicode seemed wrong.
>
> I'm not sure its too wrong, according to EncodedCharSet comment:
> "The other confusion comes from the name of "Latin1" class.  It used to mean the Latin-1 (ISO-8859-1) character set, but now it primarily means that the "Western European languages that are covered by the characters in Latin-1 character set."
> I'd reckon the same holds true for Latin1Environment (Western ), Latin2Environment (Eastern), and Latin7Environment (Greek). I don't think CP1252/8859-15 warrants the same as they are basically alternative encodings to latin1 for western languages.
>
> Also:
> - leadingChar is used in StrikeFontSet to choose different glyph sets. This allows for StrikeFonts supporting more than the default latin1 glyphs, seems to me it would be "wrong" to use the same one for two different encodings.
> Not sure why this approach was taken rather than allowing additional strike font sets based on unicode code point ranges, then using leadingChar only to differentiate when the visual glyphs for those code points would be different. I suspect it maybe was developed to deal with Han unification first, then reused to support multiple character sets later.
>
> - LanguageEnvironment seems to have been used in conjunction with translation (note the entire old translation system was removed in Pharo and replaced by an external package), maybe to decide which encoding externally stored translation files should be read in as.
> Then, having environments with overlapping supportedLanguages seem somewhat weird as well.
> Modifying defaultEncodingName/systemConverterClass of Latin1Environment to use CP1252 for some Windows systems (as per Latin2) may be another approach, may or may not lead to unintended consequences elsewhere though, I did not investigate all uses.
>
> IMHO, for someone who wasn't involved in its developemnt, the whole multilingual package could use some cleaning, more class comments, and clearer statement of responsibilities.
>
> Cheers,
> Henry
>
> TLDR;
> More converters: yay!
> More LanguageEnvironments: o_O, not sure
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project


_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: adding ISO-8859-15 and CP-1252 support

Philippe Marschall-2
In reply to this post by Henrik Sperre Johansen
On 08/17/2010 04:55 PM, Henrik Johansen wrote:

>
> On Aug 16, 2010, at 9:49 30PM, Philippe Marschall wrote:
>
>> Hi
>>
>> I decided to write ISO-8859-15 and CP-1252 support [1] (mostly for
>> selfish reasons so that Seaside on Pharo would support ISO-8859-15 and
>> CP-1252).
>
>
> More converters are always nice :D
> Their code seems ok to me.
>>
>> A couple of notes:
>> - the five unmapped bytes of CP-1252 (not ISO-8859-15, the comment is
>> wrong) are mapped to the Unicode replacement character (U+FFFD)
>> - a new Latin9 language environment is introduced
>> - some minor clean up like removing unused class variables
>>
>> I'd appreciate it if somebody knowledgeable in these areas could review
>> the changes. I'm especially unsure about the Latin9 language
>> environment, but reusing Latin1 or Unicode seemed wrong.
>
> I'm not sure its too wrong, according to EncodedCharSet comment:
> "The other confusion comes from the name of "Latin1" class.  It used to mean the Latin-1 (ISO-8859-1) character set, but now it primarily means that the "Western European languages that are covered by the characters in Latin-1 character set."
> I'd reckon the same holds true for Latin1Environment (Western ), Latin2Environment (Eastern), and Latin7Environment (Greek). I don't think CP1252/8859-15 warrants the same as they are basically alternative encodings to latin1 for western languages.
>
> Also:
> - leadingChar is used in StrikeFontSet to choose different glyph sets. This allows for StrikeFonts supporting more than the default latin1 glyphs, seems to me it would be "wrong" to use the same one for two different encodings.
> Not sure why this approach was taken rather than allowing additional strike font sets based on unicode code point ranges, then using leadingChar only to differentiate when the visual glyphs for those code points would be different. I suspect it maybe was developed to deal with Han unification first, then reused to support multiple character sets later.
>
> - LanguageEnvironment seems to have been used in conjunction with translation (note the entire old translation system was removed in Pharo and replaced by an external package), maybe to decide which encoding externally stored translation files should be read in as.
> Then, having environments with overlapping supportedLanguages seem somewhat weird as well.
> Modifying defaultEncodingName/systemConverterClass of Latin1Environment to use CP1252 for some Windows systems (as per Latin2) may be another approach, may or may not lead to unintended consequences elsewhere though, I did not investigate all uses.
>
> IMHO, for someone who wasn't involved in its developemnt, the whole multilingual package could use some cleaning, more class comments, and clearer statement of responsibilities.
>
> Cheers,
> Henry
>
> TLDR;
> More converters: yay!
> More LanguageEnvironments: o_O, not sure

OK, if nobody says it's a good idea and the right thing to do I'll drop
the LanguageEnvironment.

Cheers
Philippe


_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
Reply | Threaded
Open this post in threaded view
|

Re: adding ISO-8859-15 and CP-1252 support

Stéphane Ducasse
may be you should contact yoshiki.

On Aug 18, 2010, at 9:59 AM, Philippe Marschall wrote:

> On 08/17/2010 04:55 PM, Henrik Johansen wrote:
>>
>> On Aug 16, 2010, at 9:49 30PM, Philippe Marschall wrote:
>>
>>> Hi
>>>
>>> I decided to write ISO-8859-15 and CP-1252 support [1] (mostly for
>>> selfish reasons so that Seaside on Pharo would support ISO-8859-15 and
>>> CP-1252).
>>
>>
>> More converters are always nice :D
>> Their code seems ok to me.
>>>
>>> A couple of notes:
>>> - the five unmapped bytes of CP-1252 (not ISO-8859-15, the comment is
>>> wrong) are mapped to the Unicode replacement character (U+FFFD)
>>> - a new Latin9 language environment is introduced
>>> - some minor clean up like removing unused class variables
>>>
>>> I'd appreciate it if somebody knowledgeable in these areas could review
>>> the changes. I'm especially unsure about the Latin9 language
>>> environment, but reusing Latin1 or Unicode seemed wrong.
>>
>> I'm not sure its too wrong, according to EncodedCharSet comment:
>> "The other confusion comes from the name of "Latin1" class.  It used to mean the Latin-1 (ISO-8859-1) character set, but now it primarily means that the "Western European languages that are covered by the characters in Latin-1 character set."
>> I'd reckon the same holds true for Latin1Environment (Western ), Latin2Environment (Eastern), and Latin7Environment (Greek). I don't think CP1252/8859-15 warrants the same as they are basically alternative encodings to latin1 for western languages.
>>
>> Also:
>> - leadingChar is used in StrikeFontSet to choose different glyph sets. This allows for StrikeFonts supporting more than the default latin1 glyphs, seems to me it would be "wrong" to use the same one for two different encodings.
>> Not sure why this approach was taken rather than allowing additional strike font sets based on unicode code point ranges, then using leadingChar only to differentiate when the visual glyphs for those code points would be different. I suspect it maybe was developed to deal with Han unification first, then reused to support multiple character sets later.
>>
>> - LanguageEnvironment seems to have been used in conjunction with translation (note the entire old translation system was removed in Pharo and replaced by an external package), maybe to decide which encoding externally stored translation files should be read in as.
>> Then, having environments with overlapping supportedLanguages seem somewhat weird as well.
>> Modifying defaultEncodingName/systemConverterClass of Latin1Environment to use CP1252 for some Windows systems (as per Latin2) may be another approach, may or may not lead to unintended consequences elsewhere though, I did not investigate all uses.
>>
>> IMHO, for someone who wasn't involved in its developemnt, the whole multilingual package could use some cleaning, more class comments, and clearer statement of responsibilities.
>>
>> Cheers,
>> Henry
>>
>> TLDR;
>> More converters: yay!
>> More LanguageEnvironments: o_O, not sure
>
> OK, if nobody says it's a good idea and the right thing to do I'll drop
> the LanguageEnvironment.
>
> Cheers
> Philippe
>
>
> _______________________________________________
> Pharo-project mailing list
> [hidden email]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project


_______________________________________________
Pharo-project mailing list
[hidden email]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project